Eric Lee / smarc-fsl-linux-kernel

08 Oct, 2016

3 commits

81243eacf cred: simpler, 1D supplementary groups ... Browse Code »

Current supplementary groups code can massively overallocate memory and
is implemented in a way so that access to individual gid is done via 2D
array.

If number of gids is
Cc: Vasily Kulikov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2016-10-08 09:46:30 +0800
75ba1d07f seq/proc: modify seq_put_decimal_[u]ll to take a const char *, not char ... Browse Code »

Allow some seq_puts removals by taking a string instead of a single
char.

[akpm@linux-foundation.org: update vmstat_show(), per Joe]
Link: http://lkml.kernel.org/r/667e1cf3d436de91a5698170a1e98d882905e956.1470704995.git.joe@perches.com
Signed-off-by: Joe Perches
Cc: Joe Perches
Cc: Andi Kleen
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2016-10-08 09:46:30 +0800
f7a5f132b proc: faster /proc/*/status ... Browse Code »

top(1) opens the following files for every PID:

/proc/*/stat
/proc/*/statm
/proc/*/status

This patch switches /proc/*/status away from seq_printf().
The result is 13.5% speedup.

Benchmark is open("/proc/self/status")+read+close 1.000.000 million times.

BEFORE
$ perf stat -r 10 taskset -c 3 ./proc-self-status

Performance counter stats for 'taskset -c 3 ./proc-self-status' (10 runs):

10748.474301 task-clock (msec) # 0.954 CPUs utilized ( +- 0.91% )
12 context-switches # 0.001 K/sec ( +- 1.09% )
1 cpu-migrations # 0.000 K/sec
104 page-faults # 0.010 K/sec ( +- 0.45% )
37,424,127,876 cycles # 3.482 GHz ( +- 0.04% )
8,453,010,029 stalled-cycles-frontend # 22.59% frontend cycles idle ( +- 0.12% )
3,747,609,427 stalled-cycles-backend # 10.01% backend cycles idle ( +- 0.68% )
65,632,764,147 instructions # 1.75 insn per cycle
# 0.13 stalled cycles per insn ( +- 0.00% )
13,981,324,775 branches # 1300.773 M/sec ( +- 0.00% )
138,967,110 branch-misses # 0.99% of all branches ( +- 0.18% )

11.263885428 seconds time elapsed ( +- 0.04% )
^^^^^^^^^^^^

AFTER
$ perf stat -r 10 taskset -c 3 ./proc-self-status

Performance counter stats for 'taskset -c 3 ./proc-self-status' (10 runs):

9010.521776 task-clock (msec) # 0.925 CPUs utilized ( +- 1.54% )
11 context-switches # 0.001 K/sec ( +- 1.54% )
1 cpu-migrations # 0.000 K/sec ( +- 11.11% )
103 page-faults # 0.011 K/sec ( +- 0.60% )
32,352,310,603 cycles # 3.591 GHz ( +- 0.07% )
7,849,199,578 stalled-cycles-frontend # 24.26% frontend cycles idle ( +- 0.27% )
3,269,738,842 stalled-cycles-backend # 10.11% backend cycles idle ( +- 0.73% )
56,012,163,567 instructions # 1.73 insn per cycle
# 0.14 stalled cycles per insn ( +- 0.00% )
11,735,778,795 branches # 1302.453 M/sec ( +- 0.00% )
98,084,459 branch-misses # 0.84% of all branches ( +- 0.28% )

9.741247736 seconds time elapsed ( +- 0.07% )
^^^^^^^^^^^

Link: http://lkml.kernel.org/r/20160806125608.GB1187@p183.telecom.by
Signed-off-by: Alexey Dobriyan
Cc: Joe Perches
Cc: Andi Kleen
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2016-10-08 09:46:30 +0800

21 May, 2016

1 commit

3e42979e6 procfs: expose umask in /proc/<PID>/status ... Browse Code »

It's not possible to read the process umask without also modifying it,
which is what umask(2) does. A library cannot read umask safely,
especially if the main program might be multithreaded.

Add a new status line ("Umask") in /proc//status. It contains the
file mode creation mask (umask) in octal. It is only shown for tasks
which have task->fs.

This patch is adapted from one originally written by Pierre Carrier.

The use case is that we have endless trouble with people setting weird
umask() values (usually on the grounds of "security"), and then
everything breaking. I'm on the hook to fix these. We'd like to add
debugging to our program so we can dump out the umask in debug reports.

Previous versions of the patch used a syscall so you could only read
your own umask. That's all I need. However there was quite a lot of
push-back from those, so this new version exports it in /proc.

See:
https://lkml.org/lkml/2016/4/13/704 [umask2]
https://lkml.org/lkml/2016/4/13/487 [getumask]

Signed-off-by: Richard W.M. Jones
Acked-by: Konstantin Khlebnikov
Acked-by: Jerome Marchand
Acked-by: Kees Cook
Cc: "Theodore Ts'o"
Cc: Michal Hocko
Cc: Pierre Carrier
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Richard W.M. Jones
2016-05-21 08:58:30 +0800

21 Jan, 2016

1 commit

caaee6234 ptrace: use fsuid, fsgid, effective creds for fs access checks ... Browse Code »

By checking the effective credentials instead of the real UID / permitted
capabilities, ensure that the calling process actually intended to use its
credentials.

To ensure that all ptrace checks use the correct caller credentials (e.g.
in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS
flag), use two new flags and require one of them to be set.

The problem was that when a privileged task had temporarily dropped its
privileges, e.g. by calling setreuid(0, user_uid), with the intent to
perform following syscalls with the credentials of a user, it still passed
ptrace access checks that the user would not be able to pass.

While an attacker should not be able to convince the privileged task to
perform a ptrace() syscall, this is a problem because the ptrace access
check is reused for things in procfs.

In particular, the following somewhat interesting procfs entries only rely
on ptrace access checks:

/proc/$pid/stat - uses the check for determining whether pointers
should be visible, useful for bypassing ASLR
/proc/$pid/maps - also useful for bypassing ASLR
/proc/$pid/cwd - useful for gaining access to restricted
directories that contain files with lax permissions, e.g. in
this scenario:
lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar
drwx------ root root /root
drwxr-xr-x root root /root/foobar
-rw-r--r-- root root /root/foobar/secret

Therefore, on a system where a root-owned mode 6755 binary changes its
effective credentials as described and then dumps a user-specified file,
this could be used by an attacker to reveal the memory layout of root's
processes or reveal the contents of files he is not allowed to access
(through /proc/$pid/cwd).

[akpm@linux-foundation.org: fix warning]
Signed-off-by: Jann Horn
Acked-by: Kees Cook
Cc: Casey Schaufler
Cc: Oleg Nesterov
Cc: Ingo Molnar
Cc: James Morris
Cc: "Serge E. Hallyn"
Cc: Andy Shevchenko
Cc: Andy Lutomirski
Cc: Al Viro
Cc: "Eric W. Biederman"
Cc: Willy Tarreau
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jann Horn
2016-01-21 09:09:18 +0800

07 Nov, 2015

1 commit

3a49f3d2a fs/proc/array.c: set overflow flag in case of error ... Browse Code »

For now in task_name() we ignore the return code of string_escape_str()
call. This is not good if buffer suddenly becomes not big enough. Do the
proper error handling there.

Signed-off-by: Andy Shevchenko
Cc: Alexander Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Shevchenko
2015-11-07 09:50:42 +0800

01 Oct, 2015

1 commit

b2f73922d fs/proc, core/debug: Don't expose absolute kernel addresses via wchan ... Browse Code »

So the /proc/PID/stat 'wchan' field (the 30th field, which contains
the absolute kernel address of the kernel function a task is blocked in)
leaks absolute kernel addresses to unprivileged user-space:

seq_put_decimal_ull(m, ' ', wchan);

The absolute address might also leak via /proc/PID/wchan as well, if
KALLSYMS is turned off or if the symbol lookup fails for some reason:

static int proc_pid_wchan(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *task)
{
unsigned long wchan;
char symname[KSYM_NAME_LEN];

wchan = get_wchan(task);

if (lookup_symbol_name(wchan, symname) < 0) {
if (!ptrace_may_access(task, PTRACE_MODE_READ))
return 0;
seq_printf(m, "%lu", wchan);
} else {
seq_printf(m, "%s", symname);
}

return 0;
}

This isn't ideal, because for example it trivially leaks the KASLR offset
to any local attacker:

fomalhaut:~> printf "%016lx\n" $(cat /proc/$$/stat | cut -d' ' -f35)
ffffffff8123b380

Most real-life uses of wchan are symbolic:

ps -eo pid:10,tid:10,wchan:30,comm

and procps uses /proc/PID/wchan, not the absolute address in /proc/PID/stat:

triton:~/tip> strace -f ps -eo pid:10,tid:10,wchan:30,comm 2>&1 | grep wchan | tail -1
open("/proc/30833/wchan", O_RDONLY) = 6

There's one compatibility quirk here: procps relies on whether the
absolute value is non-zero - and we can provide that functionality
by outputing "0" or "1" depending on whether the task is blocked
(whether there's a wchan address).

These days there appears to be very little legitimate reason
user-space would be interested in the absolute address. The
absolute address is mostly historic: from the days when we
didn't have kallsyms and user-space procps had to do the
decoding itself via the System.map.

So this patch sets all numeric output to "0" or "1" and keeps only
symbolic output, in /proc/PID/wchan.

( The absolute sleep address can generally still be profiled via
perf, by tasks with sufficient privileges. )

Reviewed-by: Thomas Gleixner
Acked-by: Kees Cook
Acked-by: Linus Torvalds
Cc:
Cc: Al Viro
Cc: Alexander Potapenko
Cc: Andrey Konovalov
Cc: Andrey Ryabinin
Cc: Andy Lutomirski
Cc: Andy Lutomirski
Cc: Borislav Petkov
Cc: Denys Vlasenko
Cc: Dmitry Vyukov
Cc: Kostya Serebryany
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Peter Zijlstra
Cc: Sasha Levin
Cc: kasan-dev
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20150930135917.GA3285@gmail.com
Signed-off-by: Ingo Molnar

Ingo Molnar
2015-10-01 18:55:34 +0800

05 Sep, 2015

1 commit

58319057b capabilities: ambient capabilities ... Browse Code »

Credit where credit is due: this idea comes from Christoph Lameter with
a lot of valuable input from Serge Hallyn. This patch is heavily based
on Christoph's patch.

===== The status quo =====

On Linux, there are a number of capabilities defined by the kernel. To
perform various privileged tasks, processes can wield capabilities that
they hold.

Each task has four capability masks: effective (pE), permitted (pP),
inheritable (pI), and a bounding set (X). When the kernel checks for a
capability, it checks pE. The other capability masks serve to modify
what capabilities can be in pE.

Any task can remove capabilities from pE, pP, or pI at any time. If a
task has a capability in pP, it can add that capability to pE and/or pI.
If a task has CAP_SETPCAP, then it can add any capability to pI, and it
can remove capabilities from X.

Tasks are not the only things that can have capabilities; files can also
have capabilities. A file can have no capabilty information at all [1].
If a file has capability information, then it has a permitted mask (fP)
and an inheritable mask (fI) as well as a single effective bit (fE) [2].
File capabilities modify the capabilities of tasks that execve(2) them.

A task that successfully calls execve has its capabilities modified for
the file ultimately being excecuted (i.e. the binary itself if that
binary is ELF or for the interpreter if the binary is a script.) [3] In
the capability evolution rules, for each mask Z, pZ represents the old
value and pZ' represents the new value. The rules are:

pP' = (X & fP) | (pI & fI)
pI' = pI
pE' = (fE ? pP' : 0)
X is unchanged

For setuid binaries, fP, fI, and fE are modified by a moderately
complicated set of rules that emulate POSIX behavior. Similarly, if
euid == 0 or ruid == 0, then fP, fI, and fE are modified differently
(primary, fP and fI usually end up being the full set). For nonroot
users executing binaries with neither setuid nor file caps, fI and fP
are empty and fE is false.

As an extra complication, if you execute a process as nonroot and fE is
set, then the "secure exec" rules are in effect: AT_SECURE gets set,
LD_PRELOAD doesn't work, etc.

This is rather messy. We've learned that making any changes is
dangerous, though: if a new kernel version allows an unprivileged
program to change its security state in a way that persists cross
execution of a setuid program or a program with file caps, this
persistent state is surprisingly likely to allow setuid or file-capped
programs to be exploited for privilege escalation.

===== The problem =====

Capability inheritance is basically useless.

If you aren't root and you execute an ordinary binary, fI is zero, so
your capabilities have no effect whatsoever on pP'. This means that you
can't usefully execute a helper process or a shell command with elevated
capabilities if you aren't root.

On current kernels, you can sort of work around this by setting fI to
the full set for most or all non-setuid executable files. This causes
pP' = pI for nonroot, and inheritance works. No one does this because
it's a PITA and it isn't even supported on most filesystems.

If you try this, you'll discover that every nonroot program ends up with
secure exec rules, breaking many things.

This is a problem that has bitten many people who have tried to use
capabilities for anything useful.

===== The proposed change =====

This patch adds a fifth capability mask called the ambient mask (pA).
pA does what most people expect pI to do.

pA obeys the invariant that no bit can ever be set in pA if it is not
set in both pP and pI. Dropping a bit from pP or pI drops that bit from
pA. This ensures that existing programs that try to drop capabilities
still do so, with a complication. Because capability inheritance is so
broken, setting KEEPCAPS, using setresuid to switch to nonroot uids, and
then calling execve effectively drops capabilities. Therefore,
setresuid from root to nonroot conditionally clears pA unless
SECBIT_NO_SETUID_FIXUP is set. Processes that don't like this can
re-add bits to pA afterwards.

The capability evolution rules are changed:

pA' = (file caps or setuid or setgid ? 0 : pA)
pP' = (X & fP) | (pI & fI) | pA'
pI' = pI
pE' = (fE ? pP' : pA')
X is unchanged

If you are nonroot but you have a capability, you can add it to pA. If
you do so, your children get that capability in pA, pP, and pE. For
example, you can set pA = CAP_NET_BIND_SERVICE, and your children can
automatically bind low-numbered ports. Hallelujah!

Unprivileged users can create user namespaces, map themselves to a
nonzero uid, and create both privileged (relative to their namespace)
and unprivileged process trees. This is currently more or less
impossible. Hallelujah!

You cannot use pA to try to subvert a setuid, setgid, or file-capped
program: if you execute any such program, pA gets cleared and the
resulting evolution rules are unchanged by this patch.

Users with nonzero pA are unlikely to unintentionally leak that
capability. If they run programs that try to drop privileges, dropping
privileges will still work.

It's worth noting that the degree of paranoia in this patch could
possibly be reduced without causing serious problems. Specifically, if
we allowed pA to persist across executing non-pA-aware setuid binaries
and across setresuid, then, naively, the only capabilities that could
leak as a result would be the capabilities in pA, and any attacker
*already* has those capabilities. This would make me nervous, though --
setuid binaries that tried to privilege-separate might fail to do so,
and putting CAP_DAC_READ_SEARCH or CAP_DAC_OVERRIDE into pA could have
unexpected side effects. (Whether these unexpected side effects would
be exploitable is an open question.) I've therefore taken the more
paranoid route. We can revisit this later.

An alternative would be to require PR_SET_NO_NEW_PRIVS before setting
ambient capabilities. I think that this would be annoying and would
make granting otherwise unprivileged users minor ambient capabilities
(CAP_NET_BIND_SERVICE or CAP_NET_RAW for example) much less useful than
it is with this patch.

===== Footnotes =====

[1] Files that are missing the "security.capability" xattr or that have
unrecognized values for that xattr end up with has_cap set to false.
The code that does that appears to be complicated for no good reason.

[2] The libcap capability mask parsers and formatters are dangerously
misleading and the documentation is flat-out wrong. fE is *not* a mask;
it's a single bit. This has probably confused every single person who
has tried to use file capabilities.

[3] Linux very confusingly processes both the script and the interpreter
if applicable, for reasons that elude me. The results from thinking
about a script's file capabilities and/or setuid bits are mostly
discarded.

Preliminary userspace code is here, but it needs updating:
https://git.kernel.org/cgit/linux/kernel/git/luto/util-linux-playground.git/commit/?h=cap_ambient&id=7f5afbd175d2

Here is a test program that can be used to verify the functionality
(from Christoph):

/*
* Test program for the ambient capabilities. This program spawns a shell
* that allows running processes with a defined set of capabilities.
*
* (C) 2015 Christoph Lameter
* Released under: GPL v3 or later.
*
*
* Compile using:
*
* gcc -o ambient_test ambient_test.o -lcap-ng
*
* This program must have the following capabilities to run properly:
* Permissions for CAP_NET_RAW, CAP_NET_ADMIN, CAP_SYS_NICE
*
* A command to equip the binary with the right caps is:
*
* setcap cap_net_raw,cap_net_admin,cap_sys_nice+p ambient_test
*
*
* To get a shell with additional caps that can be inherited by other processes:
*
* ./ambient_test /bin/bash
*
*
* Verifying that it works:
*
* From the bash spawed by ambient_test run
*
* cat /proc/$$/status
*
* and have a look at the capabilities.
*/

#include
#include
#include
#include
#include
#include

/*
* Definitions from the kernel header files. These are going to be removed
* when the /usr/include files have these defined.
*/
#define PR_CAP_AMBIENT 47
#define PR_CAP_AMBIENT_IS_SET 1
#define PR_CAP_AMBIENT_RAISE 2
#define PR_CAP_AMBIENT_LOWER 3
#define PR_CAP_AMBIENT_CLEAR_ALL 4

static void set_ambient_cap(int cap)
{
int rc;

capng_get_caps_process();
rc = capng_update(CAPNG_ADD, CAPNG_INHERITABLE, cap);
if (rc) {
printf("Cannot add inheritable cap\n");
exit(2);
}
capng_apply(CAPNG_SELECT_CAPS);

/* Note the two 0s at the end. Kernel checks for these */
if (prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_RAISE, cap, 0, 0)) {
perror("Cannot set cap");
exit(1);
}
}

int main(int argc, char **argv)
{
int rc;

set_ambient_cap(CAP_NET_RAW);
set_ambient_cap(CAP_NET_ADMIN);
set_ambient_cap(CAP_SYS_NICE);

printf("Ambient_test forking shell\n");
if (execv(argv[1], argv + 1))
perror("Cannot exec");

return 0;
}

Signed-off-by: Christoph Lameter # Original author
Signed-off-by: Andy Lutomirski
Acked-by: Serge E. Hallyn
Acked-by: Kees Cook
Cc: Jonathan Corbet
Cc: Aaron Jones
Cc: Ted Ts'o
Cc: Andrew G. Morgan
Cc: Mimi Zohar
Cc: Austin S Hemmelgarn
Cc: Markku Savela
Cc: Jarkko Sakkinen
Cc: Michael Kerrisk
Cc: James Morris
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Lutomirski
2015-09-05 07:54:41 +0800

26 Jun, 2015

1 commit

2e13ba54a fs, proc: introduce CONFIG_PROC_CHILDREN ... Browse Code »

Commit 818411616baf ("fs, proc: introduce /proc//task//children
entry") introduced the children entry for checkpoint restore and the
file is only available on kernels configured with CONFIG_EXPERT and
CONFIG_CHECKPOINT_RESTORE.

This is available in most distributions (Fedora, Debian, Ubuntu, CoreOS)
because they usually enable CONFIG_EXPERT and CONFIG_CHECKPOINT_RESTORE.
But Arch does not enable CONFIG_EXPERT or CONFIG_CHECKPOINT_RESTORE.

However, the children proc file is useful outside of checkpoint restore.
I would like to use it in rkt. The rkt process exec() another program
it does not control, and that other program will fork()+exec() a child
process. I would like to find the pid of the child process from an
external tool without iterating in /proc over all processes to find
which one has a parent pid equal to rkt.

This commit introduces CONFIG_PROC_CHILDREN and makes
CONFIG_CHECKPOINT_RESTORE select it. This allows enabling
/proc//task//children without needing to enable
CONFIG_CHECKPOINT_RESTORE and CONFIG_EXPERT.

Alban tested that /proc//task//children is present when the
kernel is configured with CONFIG_PROC_CHILDREN=y but without
CONFIG_CHECKPOINT_RESTORE

Signed-off-by: Iago López Galeiras
Tested-by: Alban Crequy
Reviewed-by: Cyrill Gorcunov
Cc: Oleg Nesterov
Cc: Kees Cook
Cc: Pavel Emelyanov
Cc: Serge Hallyn
Cc: KAMEZAWA Hiroyuki
Cc: Alexander Viro
Cc: Andy Lutomirski
Cc: Djalal Harouni
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Iago López Galeiras
2015-06-26 08:00:37 +0800

25 Jun, 2015

1 commit

f51c0eaee procfs: treat parked tasks as sleeping for task state ... Browse Code »

Allowing watchdog threads to be parked means that we now have the
opportunity of actually seeing persistent parked threads in the output
of /proc//stat and /proc//status. The existing code reported
such threads as "Running", which is kind-of true if you think of the
case where we park them as part of taking cpus offline. But if we allow
parking them indefinitely, "Running" is pretty misleading, so we report
them as "Sleeping" instead.

We could simply report them with a new string, "Parked", but it feels
like it's a bit risky for userspace to see unexpected new values; the
output is already documented in Documentation/filesystems/proc.txt, and
it seems like a mistake to change that lightly.

The scheduler does report parked tasks with a "P" in debugging output
from sched_show_task() or dump_cpu_task(), but that's a different API.
Similarly, the trace_ctxwake_* routines report a "P" for parked tasks,
but again, different API.

This change seemed slightly cleaner than updating the task_state_array
to have additional rows. TASK_DEAD should be subsumed by the exit_state
bits; TASK_WAKEKILL is just a modifier; and TASK_WAKING can very
reasonably be reported as "Running" (as it is now). Only TASK_PARKED
shows up with unreasonable output here.

Signed-off-by: Chris Metcalf
Cc: Don Zickus
Cc: Ingo Molnar
Cc: Ulrich Obergfell
Cc: Thomas Gleixner
Cc: Peter Zijlstra
Cc: Frederic Weisbecker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chris Metcalf
2015-06-25 08:49:40 +0800

16 Apr, 2015

3 commits

25ce31916 proc: remove use of seq_printf return value ... Browse Code »

The seq_printf return value, because it's frequently misused,
will eventually be converted to void.

See: commit 1f33c41c03da ("seq_file: Rename seq_overflow() to
seq_has_overflowed() and make public")

Signed-off-by: Joe Perches
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2015-04-16 07:35:25 +0800
41416f233 lib/string_helpers.c: change semantics of string_escape_mem ... Browse Code »

The current semantics of string_escape_mem are inadequate for one of its
current users, vsnprintf(). If that is to honour its contract, it must
know how much space would be needed for the entire escaped buffer, and
string_escape_mem provides no way of obtaining that (short of allocating a
large enough buffer (~4 times input string) to let it play with, and
that's definitely a big no-no inside vsnprintf).

So change the semantics for string_escape_mem to be more snprintf-like:
Return the size of the output that would be generated if the destination
buffer was big enough, but of course still only write to the part of dst
it is allowed to, and (contrary to snprintf) don't do '\0'-termination.
It is then up to the caller to detect whether output was truncated and to
append a '\0' if desired. Also, we must output partial escape sequences,
otherwise a call such as snprintf(buf, 3, "%1pE", "\123") would cause
printf to write a \0 to buf[2] but leaving buf[0] and buf[1] with whatever
they previously contained.

This also fixes a bug in the escaped_string() helper function, which used
to unconditionally pass a length of "end-buf" to string_escape_mem();
since the latter doesn't check osz for being insanely large, it would
happily write to dst. For example, kasprintf(GFP_KERNEL, "something and
then %pE", ...); is an easy way to trigger an oops.

In test-string_helpers.c, the -ENOMEM test is replaced with testing for
getting the expected return value even if the buffer is too small. We
also ensure that nothing is written (by relying on a NULL pointer deref)
if the output size is 0 by passing NULL - this has to work for
kasprintf("%pE") to work.

In net/sunrpc/cache.c, I think qword_add still has the same semantics.
Someone should definitely double-check this.

In fs/proc/array.c, I made the minimum possible change, but longer-term it
should stop poking around in seq_file internals.

[andriy.shevchenko@linux.intel.com: simplify qword_add]
[andriy.shevchenko@linux.intel.com: add missed curly braces]
Signed-off-by: Rasmus Villemoes
Acked-by: Andy Shevchenko
Signed-off-by: Andy Shevchenko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rasmus Villemoes
2015-04-16 07:35:24 +0800
e4bc33245 /proc/PID/status: show all sets of pid according to ns ... Browse Code »

If some issues occurred inside a container guest, host user could not know
which process is in trouble just by guest pid: the users of container
guest only knew the pid inside containers. This will bring obstacle for
trouble shooting.

This patch adds four fields: NStgid, NSpid, NSpgid and NSsid:

a) In init_pid_ns, nothing changed;

b) In one pidns, will tell the pid inside containers:
NStgid: 21776 5 1
NSpid: 21776 5 1
NSpgid: 21776 5 1
NSsid: 21729 1 0
** Process id is 21776 in level 0, 5 in level 1, 1 in level 2.

c) If pidns is nested, it depends on which pidns are you in.
NStgid: 5 1
NSpid: 5 1
NSpgid: 5 1
NSsid: 1 0
** Views from level 1

[akpm@linux-foundation.org: add CONFIG_PID_NS ifdef]
Signed-off-by: Chen Hanxiao
Acked-by: Serge Hallyn
Acked-by: "Eric W. Biederman"
Tested-by: Serge Hallyn
Tested-by: Nathan Scott
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chen Hanxiao
2015-04-16 07:35:22 +0800

14 Feb, 2015

1 commit

a0c2e07d6 proc: use %*pb[l] to print bitmaps including cpumasks and nodemasks ... Browse Code »

printk and friends can now format bitmaps using '%*pb[l]'. cpumask
and nodemask also provide cpumask_pr_args() and nodemask_pr_args()
respectively which can be used to generate the two printf arguments
necessary to format the specified cpu/nodemask.

Signed-off-by: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tejun Heo
2015-02-14 13:21:38 +0800

13 Feb, 2015

1 commit

edc924e02 fs/proc/array.c: convert to use string_escape_str() ... Browse Code »

Instead of custom approach let's use string_escape_str() to escape a given
string (task_name in this case).

Signed-off-by: Andy Shevchenko
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Shevchenko
2015-02-13 10:54:12 +0800

11 Dec, 2014

4 commits

abdba6e9e proc: task_state: ptrace_parent() doesn't need pid_alive() check ... Browse Code »

p->ptrace != 0 means that release_task(p) was not called, so pid_alive()
buys nothing and we can remove this check. Other callers already use it
directly without additional checks.

Note: with or without this patch ptrace_parent() can return the pointer to
the freed task, this will be explained/fixed later.

Signed-off-by: Oleg Nesterov
Cc: Aaron Tomlin
Cc: Alexey Dobriyan
Cc: "Eric W. Biederman" ,
Cc: Sterling Alexander
Cc: Peter Zijlstra
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-12-11 09:41:09 +0800
b0fafc111 proc: task_state: move the main seq_printf() outside of rcu_read_lock() ... Browse Code »

task_state() does seq_printf() under rcu_read_lock(), but this is only
needed for task_tgid_nr_ns() and task_numa_group_id(). We can calculate
tgid/ngid and drop rcu lock.

Signed-off-by: Oleg Nesterov
Cc: Aaron Tomlin
Cc: Alexey Dobriyan
Cc: "Eric W. Biederman" ,
Cc: Sterling Alexander
Cc: Peter Zijlstra
Cc: Roland McGrath
Reviewed-by: Paul E. McKenney
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-12-11 09:41:09 +0800
0f4a0d53f proc: task_state: deuglify the max_fds calculation ... Browse Code »

1. The usage of fdt looks very ugly, it can't be NULL if ->files is
not NULL. We can use "unsigned int max_fds" instead.

2. This also allows to move seq_printf(max_fds) outside of task_lock()
and join it with the previous seq_printf(). See also the next patch.

Signed-off-by: Oleg Nesterov
Cc: Aaron Tomlin
Cc: Alexey Dobriyan
Cc: "Eric W. Biederman" ,
Cc: Sterling Alexander
Cc: Peter Zijlstra
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-12-11 09:41:09 +0800
4af1036df proc: task_state: read cred->group_info outside of task_lock() ... Browse Code »

task_state() reads cred->group_info under task_lock() because a long ago
it was task_struct->group_info and it was actually protected by
task->alloc_lock. Today this task_unlock() after rcu_read_unlock() just
adds the confusion, move task_unlock() up.

Signed-off-by: Oleg Nesterov
Cc: Aaron Tomlin
Cc: Alexey Dobriyan
Cc: "Eric W. Biederman" ,
Cc: Sterling Alexander
Cc: Peter Zijlstra
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-12-11 09:41:09 +0800

06 Aug, 2014

1 commit

bb2cbf5e9 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security ... Browse Code »

Pull security subsystem updates from James Morris:
"In this release:

- PKCS#7 parser for the key management subsystem from David Howells
- appoint Kees Cook as seccomp maintainer
- bugfixes and general maintenance across the subsystem"

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (94 commits)
X.509: Need to export x509_request_asymmetric_key()
netlabel: shorter names for the NetLabel catmap funcs/structs
netlabel: fix the catmap walking functions
netlabel: fix the horribly broken catmap functions
netlabel: fix a problem when setting bits below the previously lowest bit
PKCS#7: X.509 certificate issuer and subject are mandatory fields in the ASN.1
tpm: simplify code by using %*phN specifier
tpm: Provide a generic means to override the chip returned timeouts
tpm: missing tpm_chip_put in tpm_get_random()
tpm: Properly clean sysfs entries in error path
tpm: Add missing tpm_do_selftest to ST33 I2C driver
PKCS#7: Use x509_request_asymmetric_key()
Revert "selinux: fix the default socket labeling in sock_graft()"
X.509: x509_request_asymmetric_keys() doesn't need string length arguments
PKCS#7: fix sparse non static symbol warning
KEYS: revert encrypted key change
ima: add support for measuring and appraising firmware
firmware_class: perform new LSM checks
security: introduce kernel_fw_from_file hook
PKCS#7: Missing inclusion of linux/err.h
...

Linus Torvalds
2014-08-06 23:06:39 +0800

24 Jul, 2014

2 commits

7d8b6c637 CAPABILITIES: remove undefined caps from all processes ... Browse Code »

This is effectively a revert of 7b9a7ec565505699f503b4fcf61500dceb36e744
plus fixing it a different way...

We found, when trying to run an application from an application which
had dropped privs that the kernel does security checks on undefined
capability bits. This was ESPECIALLY difficult to debug as those
undefined bits are hidden from /proc/$PID/status.

Consider a root application which drops all capabilities from ALL 4
capability sets. We assume, since the application is going to set
eff/perm/inh from an array that it will clear not only the defined caps
less than CAP_LAST_CAP, but also the higher 28ish bits which are
undefined future capabilities.

The BSET gets cleared differently. Instead it is cleared one bit at a
time. The problem here is that in security/commoncap.c::cap_task_prctl()
we actually check the validity of a capability being read. So any task
which attempts to 'read all things set in bset' followed by 'unset all
things set in bset' will not even attempt to unset the undefined bits
higher than CAP_LAST_CAP.

So the 'parent' will look something like:
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: ffffffc000000000

All of this 'should' be fine. Given that these are undefined bits that
aren't supposed to have anything to do with permissions. But they do...

So lets now consider a task which cleared the eff/perm/inh completely
and cleared all of the valid caps in the bset (but not the invalid caps
it couldn't read out of the kernel). We know that this is exactly what
the libcap-ng library does and what the go capabilities library does.
They both leave you in that above situation if you try to clear all of
you capapabilities from all 4 sets. If that root task calls execve()
the child task will pick up all caps not blocked by the bset. The bset
however does not block bits higher than CAP_LAST_CAP. So now the child
task has bits in eff which are not in the parent. These are
'meaningless' undefined bits, but still bits which the parent doesn't
have.

The problem is now in cred_cap_issubset() (or any operation which does a
subset test) as the child, while a subset for valid cap bits, is not a
subset for invalid cap bits! So now we set durring commit creds that
the child is not dumpable. Given it is 'more priv' than its parent. It
also means the parent cannot ptrace the child and other stupidity.

The solution here:
1) stop hiding capability bits in status
This makes debugging easier!

2) stop giving any task undefined capability bits. it's simple, it you
don't put those invalid bits in CAP_FULL_SET you won't get them in init
and you won't get them in any other task either.
This fixes the cap_issubset() tests and resulting fallout (which
made the init task in a docker container untraceable among other
things)

3) mask out undefined bits when sys_capset() is called as it might use
~0, ~0 to denote 'all capabilities' for backward/forward compatibility.
This lets 'capsh --caps="all=eip" -- -c /bin/bash' run.

4) mask out undefined bit when we read a file capability off of disk as
again likely all bits are set in the xattr for forward/backward
compatibility.
This lets 'setcap all+pe /bin/bash; /bin/bash' run

Signed-off-by: Eric Paris
Reviewed-by: Kees Cook
Cc: Andrew Vagin
Cc: Andrew G. Morgan
Cc: Serge E. Hallyn
Cc: Kees Cook
Cc: Steve Grubb
Cc: Dan Walsh
Cc: stable@vger.kernel.org
Signed-off-by: James Morris

Eric Paris
2014-07-24 19:53:47 +0800
57e0be041 sched: Make task->real_start_time nanoseconds based ... Browse Code »

Simplify the only user of this data by removing the timespec
conversion.

Signed-off-by: Thomas Gleixner
Signed-off-by: John Stultz

Thomas Gleixner
2014-07-24 01:18:05 +0800

08 Apr, 2014

1 commit

ad86622b4 wait: swap EXIT_ZOMBIE and EXIT_DEAD to hide EXIT_TRACE from user-space ... Browse Code »

get_task_state() uses the most significant bit to report the state to
user-space, this means that EXIT_ZOMBIE->EXIT_TRACE->EXIT_DEAD transition
can be noticed via /proc as Z -> X -> Z change. Note that this was
possible even before EXIT_TRACE was introduced.

This is not really bad but imho it make sense to hide EXIT_TRACE from
user-space completely. So the patch simply swaps EXIT_ZOMBIE and
EXIT_DEAD, this way EXIT_TRACE will be seen as EXIT_ZOMBIE by user-space.

Signed-off-by: Oleg Nesterov
Cc: Jan Kratochvil
Cc: Michal Schmidt
Cc: Al Viro
Cc: Lennart Poettering
Cc: Roland McGrath
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-04-08 07:36:06 +0800

24 Jan, 2014

2 commits

185ee40ee fs/proc/array.c: change do_task_stat() to use while_each_thread() ... Browse Code »

Change the remaining next_thread (ab)users to use while_each_thread().

The last user which should be changed is next_tid(), but we can't do this
now.

__exit_signal() and complete_signal() are fine, they actually need
next_thread() logic.

This patch (of 3):

do_task_stat() can use while_each_thread(), no changes in
the compiled code.

Signed-off-by: Oleg Nesterov
Cc: KOSAKI Motohiro
Cc: Al Viro
Cc: Kees Cook
Reviewed-by: Sameer Nanda
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-01-24 08:37:02 +0800
74e37200d proc: cleanup/simplify get_task_state/task_state_array ... Browse Code »

get_task_state() and task_state_array[] look confusing and suboptimal, it
is not clear what it can actually report to user-space and
task_state_array[] blows .data for no reason.

1. state = (tsk->state & TASK_REPORT) | tsk->exit_state is not
clear. TASK_REPORT is self-documenting but it is not clear
what ->exit_state can add.

Move the potential exit_state's (EXIT_ZOMBIE and EXIT_DEAD)
into TASK_REPORT and use it to calculate the final result.

2. With the change above it is obvious that task_state_array[]
has the unused entries just to make BUILD_BUG_ON() happy.

Change this BUILD_BUG_ON() to use TASK_REPORT rather than
TASK_STATE_MAX and shrink task_state_array[].

3. Turn the "while (state)" loop into fls(state).

Signed-off-by: Oleg Nesterov
Cc: Peter Zijlstra
Cc: David Laight
Cc: Geert Uytterhoeven
Cc: Ingo Molnar
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-01-24 08:37:01 +0800

09 Oct, 2013

1 commit

e29cf08b0 sched/numa: Report a NUMA task group ID ... Browse Code »

It is desirable to model from userspace how the scheduler groups tasks
over time. This patch adds an ID to the numa_group and reports it via
/proc/PID/status.

Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Cc: Andrea Arcangeli
Cc: Johannes Weiner
Cc: Srikar Dronamraju
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1381141781-10992-45-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar

Mel Gorman
2013-10-09 20:47:49 +0800

12 Apr, 2013

1 commit

f2530dc71 kthread: Prevent unpark race which puts threads on the wrong cpu ... Browse Code »

The smpboot threads rely on the park/unpark mechanism which binds per
cpu threads on a particular core. Though the functionality is racy:

CPU0 CPU1 CPU2
unpark(T) wake_up_process(T)
clear(SHOULD_PARK) T runs
leave parkme() due to !SHOULD_PARK
bind_to(CPU2) BUG_ON(wrong CPU)

We cannot let the tasks move themself to the target CPU as one of
those tasks is actually the migration thread itself, which requires
that it starts running on the target cpu right away.

The solution to this problem is to prevent wakeups in park mode which
are not from unpark(). That way we can guarantee that the association
of the task to the target cpu is working correctly.

Add a new task state (TASK_PARKED) which prevents other wakeups and
use this state explicitly for the unpark wakeup.

Peter noticed: Also, since the task state is visible to userspace and
all the parked tasks are still in the PID space, its a good hint in ps
and friends that these tasks aren't really there for the moment.

The migration thread has another related issue.

CPU0 CPU1
Bring up CPU2
create_thread(T)
park(T)
wait_for_completion()
parkme()
complete()
sched_set_stop_task()
schedule(TASK_PARKED)

The sched_set_stop_task() call is issued while the task is on the
runqueue of CPU1 and that confuses the hell out of the stop_task class
on that cpu. So we need the same synchronizaion before
sched_set_stop_task().

Reported-by: Dave Jones
Reported-and-tested-by: Dave Hansen
Reported-and-tested-by: Borislav Petkov
Acked-by: Peter Ziljstra
Cc: Srivatsa S. Bhat
Cc: dhillf@gmail.com
Cc: Ingo Molnar
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304091635430.21884@ionos
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2013-04-12 20:18:43 +0800

28 Jan, 2013

1 commit

6fac4829c cputime: Use accessors to read task cputime stats ... Browse Code »

This is in preparation for the full dynticks feature. While
remotely reading the cputime of a task running in a full
dynticks CPU, we'll need to do some extra-computation. This
way we can account the time it spent tickless in userspace
since its last cputime snapshot.

Signed-off-by: Frederic Weisbecker
Cc: Andrew Morton
Cc: Ingo Molnar
Cc: Li Zhong
Cc: Namhyung Kim
Cc: Paul E. McKenney
Cc: Paul Gortmaker
Cc: Peter Zijlstra
Cc: Steven Rostedt
Cc: Thomas Gleixner

Frederic Weisbecker
2013-01-28 02:23:31 +0800

18 Dec, 2012

6 commits

848b81415 Merge branch 'akpm' (Andrew's patch-bomb) ... Browse Code »

Merge misc patches from Andrew Morton:
"Incoming:

- lots of misc stuff

- backlight tree updates

- lib/ updates

- Oleg's percpu-rwsem changes

- checkpatch

- rtc

- aoe

- more checkpoint/restart support

I still have a pile of MM stuff pending - Pekka should be merging
later today after which that is good to go. A number of other things
are twiddling thumbs awaiting maintainer merges."

* emailed patches from Andrew Morton : (180 commits)
scatterlist: don't BUG when we can trivially return a proper error.
docs: update documentation about /proc//fdinfo/ fanotify output
fs, fanotify: add @mflags field to fanotify output
docs: add documentation about /proc//fdinfo/ output
fs, notify: add procfs fdinfo helper
fs, exportfs: add exportfs_encode_inode_fh() helper
fs, exportfs: escape nil dereference if no s_export_op present
fs, epoll: add procfs fdinfo helper
fs, eventfd: add procfs fdinfo helper
procfs: add ability to plug in auxiliary fdinfo providers
tools/testing/selftests/kcmp/kcmp_test.c: print reason for failure in kcmp_test
breakpoint selftests: print failure status instead of cause make error
kcmp selftests: print fail status instead of cause make error
kcmp selftests: make run_tests fix
mem-hotplug selftests: print failure status instead of cause make error
cpu-hotplug selftests: print failure status instead of cause make error
mqueue selftests: print failure status instead of cause make error
vm selftests: print failure status instead of cause make error
ubifs: use prandom_bytes
mtd: nandsim: use prandom_bytes
...

Linus Torvalds
2012-12-18 12:58:12 +0800
138d22b58 fs, epoll: add procfs fdinfo helper ... Browse Code »

This allows us to print out eventpoll target file descriptor, events and
data, the /proc/pid/fdinfo/fd consists of

| pos: 0
| flags: 02
| tfd: 5 events: 1d data: ffffffffffffffff enabled: 1

[avagin@: fix for unitialized ret variable]

Signed-off-by: Cyrill Gorcunov
Acked-by: Pavel Emelyanov
Cc: Oleg Nesterov
Cc: Andrey Vagin
Cc: Al Viro
Cc: Alexey Dobriyan
Cc: James Bottomley
Cc: "Aneesh Kumar K.V"
Cc: Alexey Dobriyan
Cc: Matthew Helsley
Cc: "J. Bruce Fields"
Cc: "Aneesh Kumar K.V"
Cc: Tvrtko Ursulin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2012-12-18 09:15:27 +0800
8d238027b proc: pid/status: show all supplementary groups ... Browse Code »

We display a list of supplementary group for each process in
/proc//status. However, we show only the first 32 groups, not all of
them.

Although this is rare, but sometimes processes do have more than 32
supplementary groups, and this kernel limitation breaks user-space apps
that rely on the group list in /proc//status.

Number 32 comes from the internal NGROUPS_SMALL macro which defines the
length for the internal kernel "small" groups buffer. There is no
apparent reason to limit to this value.

This patch removes the 32 groups printing limit.

The Linux kernel limits the amount of supplementary groups by NGROUPS_MAX,
which is currently set to 65536. And this is the maximum count of groups
we may possibly print.

Signed-off-by: Artem Bityutskiy
Acked-by: Serge E. Hallyn
Acked-by: Kees Cook
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Artem Bityutskiy
2012-12-18 09:15:23 +0800
2f4b3bf6b /proc/pid/status: add "Seccomp" field ... Browse Code »

It is currently impossible to examine the state of seccomp for a given
process. While attaching with gdb and attempting "call
prctl(PR_GET_SECCOMP,...)" will work with some situations, it is not
reliable. If the process is in seccomp mode 1, this query will kill the
process (prctl not allowed), if the process is in mode 2 with prctl not
allowed, it will similarly be killed, and in weird cases, if prctl is
filtered to return errno 0, it can look like seccomp is disabled.

When reviewing the state of running processes, there should be a way to
externally examine the seccomp mode. ("Did this build of Chrome end up
using seccomp?" "Did my distro ship ssh with seccomp enabled?")

This adds the "Seccomp" line to /proc/$pid/status.

Signed-off-by: Kees Cook
Reviewed-by: Cyrill Gorcunov
Cc: Andrea Arcangeli
Cc: James Morris
Acked-by: Serge E. Hallyn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kees Cook
2012-12-18 09:15:22 +0800
7b9a7ec56 proc: don't show nonexistent capabilities ... Browse Code »

Without this patch it is really hard to interpret a bounding set, if
CAP_LAST_CAP is unknown for a current kernel.

Non-existant capabilities can not be deleted from a bounding set with help
of prctl.

E.g.: Here are two examples without/with this patch.

CapBnd: ffffffe0fdecffff
CapBnd: 00000000fdecffff

I suggest to hide non-existent capabilities. Here is two reasons.
* It's logically and easier for using.
* It helps to checkpoint-restore capabilities of tasks, because tasks
can be restored on another kernel, where CAP_LAST_CAP is bigger.

Signed-off-by: Andrew Vagin
Cc: Andrew G. Morgan
Reviewed-by: Serge E. Hallyn
Cc: Pavel Emelyanov
Reviewed-by: Kees Cook
Cc: KAMEZAWA Hiroyuki
Cc: James Morris
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Vagin
2012-12-18 09:15:22 +0800
6a2b60b17 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull user namespace changes from Eric Biederman:
"While small this set of changes is very significant with respect to
containers in general and user namespaces in particular. The user
space interface is now complete.

This set of changes adds support for unprivileged users to create user
namespaces and as a user namespace root to create other namespaces.
The tyranny of supporting suid root preventing unprivileged users from
using cool new kernel features is broken.

This set of changes completes the work on setns, adding support for
the pid, user, mount namespaces.

This set of changes includes a bunch of basic pid namespace
cleanups/simplifications. Of particular significance is the rework of
the pid namespace cleanup so it no longer requires sending out
tendrils into all kinds of unexpected cleanup paths for operation. At
least one case of broken error handling is fixed by this cleanup.

The files under /proc//ns/ have been converted from regular files
to magic symlinks which prevents incorrect caching by the VFS,
ensuring the files always refer to the namespace the process is
currently using and ensuring that the ptrace_mayaccess permission
checks are always applied.

The files under /proc//ns/ have been given stable inode numbers
so it is now possible to see if different processes share the same
namespaces.

Through the David Miller's net tree are changes to relax many of the
permission checks in the networking stack to allowing the user
namespace root to usefully use the networking stack. Similar changes
for the mount namespace and the pid namespace are coming through my
tree.

Two small changes to add user namespace support were commited here adn
in David Miller's -net tree so that I could complete the work on the
/proc//ns/ files in this tree.

Work remains to make it safe to build user namespaces and 9p, afs,
ceph, cifs, coda, gfs2, ncpfs, nfs, nfsd, ocfs2, and xfs so the
Kconfig guard remains in place preventing that user namespaces from
being built when any of those filesystems are enabled.

Future design work remains to allow root users outside of the initial
user namespace to mount more than just /proc and /sys."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (38 commits)
proc: Usable inode numbers for the namespace file descriptors.
proc: Fix the namespace inode permission checks.
proc: Generalize proc inode allocation
userns: Allow unprivilged mounts of proc and sysfs
userns: For /proc/self/{uid,gid}_map derive the lower userns from the struct file
procfs: Print task uids and gids in the userns that opened the proc file
userns: Implement unshare of the user namespace
userns: Implent proc namespace operations
userns: Kill task_user_ns
userns: Make create_new_namespaces take a user_ns parameter
userns: Allow unprivileged use of setns.
userns: Allow unprivileged users to create new namespaces
userns: Allow setting a userns mapping to your current uid.
userns: Allow chown and setgid preservation
userns: Allow unprivileged users to create user namespaces.
userns: Ignore suid and sgid on binaries if the uid or gid can not be mapped
userns: fix return value on mntns_install() failure
vfs: Allow unprivileged manipulation of the mount namespace.
vfs: Only support slave subtrees across different user namespaces
vfs: Add a user namespace reference from struct mnt_namespace
...

Linus Torvalds
2012-12-18 07:44:47 +0800

29 Nov, 2012

1 commit

e80d0a1ae cputime: Rename thread_group_times to thread_group_cputime_adjusted ... Browse Code »

We have thread_group_cputime() and thread_group_times(). The naming
doesn't provide enough information about the difference between
these two APIs.

To lower the confusion, rename thread_group_times() to
thread_group_cputime_adjusted(). This name better suggests that
it's a version of thread_group_cputime() that does some stabilization
on the raw cputime values. ie here: scale on top of CFS runtime
stats and bound lower value for monotonicity.

Signed-off-by: Frederic Weisbecker
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Steven Rostedt
Cc: Paul Gortmaker

Frederic Weisbecker
2012-11-29 00:07:57 +0800

20 Nov, 2012

1 commit

e9f238c30 procfs: Print task uids and gids in the userns that opened the proc file ... Browse Code »

Instead of using current_userns() use the userns of the opener
of the file so that if the file is passed between processes
the contents of the file do not change.

Acked-by: Serge E. Hallyn
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2012-11-20 20:18:15 +0800

01 Jun, 2012

3 commits

5b172087f c/r: procfs: add arg_start/end, env_start/end and exit_code members to /proc/$pid/stat ... Browse Code »

We would like to have an ability to restore command line arguments and
program environment pointers but first we need to obtain them somehow.
Thus we put these values into /proc/$pid/stat. The exit_code is needed to
restore zombie tasks.

Signed-off-by: Cyrill Gorcunov
Acked-by: Kees Cook
Cc: Pavel Emelyanov
Cc: Serge Hallyn
Cc: KAMEZAWA Hiroyuki
Cc: Alexey Dobriyan
Cc: Tejun Heo
Cc: Andrew Vagin
Cc: Vasiliy Kulikov
Cc: Alexey Dobriyan
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2012-06-01 08:49:32 +0800
818411616 fs, proc: introduce /proc/<pid>/task/<tid>/children entry ... Browse Code »

When we do checkpoint of a task we need to know the list of children the
task, has but there is no easy and fast way to generate reverse
parent->children chain from arbitrary (while a parent pid is
provided in "PPid" field of /proc//status).

So instead of walking over all pids in the system (creating one big
process tree in memory, just to figure out which children a task has) --
we add explicit /proc//task//children entry, because the kernel
already has this kind of information but it is not yet exported.

This is a first level children, not the whole process tree.

Signed-off-by: Cyrill Gorcunov
Reviewed-by: Oleg Nesterov
Reviewed-by: Kees Cook
Cc: Pavel Emelyanov
Cc: Serge Hallyn
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2012-06-01 08:49:32 +0800
715be1fce procfs: use more apprioriate types when dumping /proc/N/stat ... Browse Code »

- use int fpr priority and nice, since task_nice()/task_prio() return that

- field 24: get_mm_rss() returns unsigned long

Signed-off-by: Jan Engelhardt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Engelhardt
2012-06-01 08:49:29 +0800

16 May, 2012

1 commit

dcb0f2228 userns: Convert proc to use kuid/kgid where appropriate ... Browse Code »

Acked-by: Serge Hallyn
Signed-off-by: Eric W. Biederman

Eric W. Biederman
2012-05-16 05:59:28 +0800