11 Jan, 2012

1 commit

  • Add support for mount options to restrict access to /proc/PID/
    directories. The default backward-compatible "relaxed" behaviour is left
    untouched.

    The first mount option is called "hidepid" and its value defines how much
    info about processes we want to be available for non-owners:

    hidepid=0 (default) means the old behavior - anybody may read all
    world-readable /proc/PID/* files.

    hidepid=1 means users may not access any /proc// directories, but
    their own. Sensitive files like cmdline, sched*, status are now protected
    against other users. As permission checking done in proc_pid_permission()
    and files' permissions are left untouched, programs expecting specific
    files' modes are not confused.

    hidepid=2 means hidepid=1 plus all /proc/PID/ will be invisible to other
    users. It doesn't mean that it hides whether a process exists (it can be
    learned by other means, e.g. by kill -0 $PID), but it hides process' euid
    and egid. It compicates intruder's task of gathering info about running
    processes, whether some daemon runs with elevated privileges, whether
    another user runs some sensitive program, whether other users run any
    program at all, etc.

    gid=XXX defines a group that will be able to gather all processes' info
    (as in hidepid=0 mode). This group should be used instead of putting
    nonroot user in sudoers file or something. However, untrusted users (like
    daemons, etc.) which are not supposed to monitor the tasks in the whole
    system should not be added to the group.

    hidepid=1 or higher is designed to restrict access to procfs files, which
    might reveal some sensitive private information like precise keystrokes
    timings:

    http://www.openwall.com/lists/oss-security/2011/11/05/3

    hidepid=1/2 doesn't break monitoring userspace tools. ps, top, pgrep, and
    conky gracefully handle EPERM/ENOENT and behave as if the current user is
    the only user running processes. pstree shows the process subtree which
    contains "pstree" process.

    Note: the patch doesn't deal with setuid/setgid issues of keeping
    preopened descriptors of procfs files (like
    https://lkml.org/lkml/2011/2/7/368). We rely on that the leaked
    information like the scheduling counters of setuid apps doesn't threaten
    anybody's privacy - only the user started the setuid program may read the
    counters.

    Signed-off-by: Vasiliy Kulikov
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Cc: Randy Dunlap
    Cc: "H. Peter Anvin"
    Cc: Greg KH
    Cc: Theodore Tso
    Cc: Alan Cox
    Cc: James Morris
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasiliy Kulikov
     

09 Jan, 2009

1 commit

  • Currently task_active_pid_ns is not safe to call after a task becomes a
    zombie and exit_task_namespaces is called, as nsproxy becomes NULL. By
    reading the pid namespace from the pid of the task we can trivially solve
    this problem at the cost of one extra memory read in what should be the
    same cacheline as we read the namespace from.

    When moving things around I have made task_active_pid_ns out of line
    because keeping it in pid_namespace.h would require adding includes of
    pid.h and sched.h that I don't think we want.

    This change does make task_active_pid_ns unsafe to call during
    copy_process until we attach a pid on the task_struct which seems to be a
    reasonable trade off.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Sukadev Bhattiprolu
    Cc: Oleg Nesterov
    Cc: Roland McGrath
    Cc: Bastian Blank
    Cc: Pavel Emelyanov
    Cc: Nadia Derbey
    Acked-by: Serge Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

17 Oct, 2008

1 commit


26 Jul, 2008

2 commits


30 Apr, 2008

1 commit

  • These values represent the nesting level of a namespace and pids living in it,
    and it's always non-negative.

    Turning this from int to unsigned int saves some space in pid.c (11 bytes on
    x86 and 64 on ia64) by letting the compiler optimize the pid_nr_ns a bit.
    E.g. on ia64 this removes the sign extension calls, which compiler adds to
    optimize access to pid->nubers[ns->level].

    Signed-off-by: Pavel Emelyanov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     

09 Feb, 2008

1 commit

  • Just like with the user namespaces, move the namespace management code into
    the separate .c file and mark the (already existing) PID_NS option as "depend
    on NAMESPACES"

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Pavel Emelyanov
    Acked-by: Serge Hallyn
    Cc: Cedric Le Goater
    Cc: "Eric W. Biederman"
    Cc: Herbert Poetzl
    Cc: Kirill Korotaev
    Cc: Sukadev Bhattiprolu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     

15 Nov, 2007

1 commit

  • This is my trivial patch to swat innumerable little bugs with a single
    blow.

    After some intensive review (my apologies for not having gotten to this
    sooner) what we have looks like a good base to build on with the current
    pid namespace code but it is not complete, and it is still much to simple
    to find issues where the kernel does the wrong thing outside of the initial
    pid namespace.

    Until the dust settles and we are certain we have the ABI and the
    implementation is as correct as humanly possible let's keep process ID
    namespaces behind CONFIG_EXPERIMENTAL.

    Allowing us the option of fixing any ABI or other bugs we find as long as
    they are minor.

    Allowing users of the kernel to avoid those bugs simply by ensuring their
    kernel does not have support for multiple pid namespaces.

    [akpm@linux-foundation.org: coding-style cleanups]
    Signed-off-by: Eric W. Biederman
    Cc: Cedric Le Goater
    Cc: Adrian Bunk
    Cc: Jeremy Fitzhardinge
    Cc: Kir Kolyshkin
    Cc: Kirill Korotaev
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

20 Oct, 2007

7 commits

  • * remove pid.h from pid_namespaces.h;
    * rework is_(cgroup|global)_init;
    * optimize (get|put)_pid_ns for init_pid_ns;
    * declare task_child_reaper to return actual reaper.

    Signed-off-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Sukadev Bhattiprolu
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • Each pid namespace have to be visible through its own proc mount. Thus we
    need to have per-namespace proc trees with their own superblocks.

    We cannot easily show different pid namespace via one global proc tree, since
    each pid refers to different tasks in different namespaces. E.g. pid 1
    refers to the init task in the initial namespace and to some other task when
    seeing from another namespace. Moreover - pid, exisintg in one namespace may
    not exist in the other.

    This approach has one move advantage is that the tasks from the init namespace
    can see what tasks live in another namespace by reading entries from another
    proc tree.

    Signed-off-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Sukadev Bhattiprolu
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • Each namespace has a parent and is characterized by its "level". Level is the
    number of the namespace generation. E.g. init namespace has level 0, after
    cloning new one it will have level 1, the next one - 2 and so on and so forth.
    This level is not explicitly limited.

    True hierarchy must have some way to find each namespace's children, but it is
    not used in the patches, so this ability is not added (yet).

    Signed-off-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Sukadev Bhattiprolu
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • Rename the child_reaper() function to task_child_reaper() to be similar to
    other task_* functions and to distinguish the function from 'struct
    pid_namspace.child_reaper'.

    Signed-off-by: Sukadev Bhattiprolu
    Cc: Pavel Emelianov
    Cc: Eric W. Biederman
    Cc: Cedric Le Goater
    Cc: Dave Hansen
    Cc: Serge Hallyn
    Cc: Herbert Poetzel
    Cc: Kirill Korotaev
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • With multiple pid namespaces, a process is known by some pid_t in every
    ancestor pid namespace. Every time the process forks, the child process also
    gets a pid_t in every ancestor pid namespace.

    While a process is visible in >=1 pid namespaces, it can see pid_t's in only
    one pid namespace. We call this pid namespace it's "active pid namespace",
    and it is always the youngest pid namespace in which the process is known.

    This patch defines and uses a wrapper to find the active pid namespace of a
    process. The implementation of the wrapper will be changed in when support
    for multiple pid namespaces are added.

    Changelog:
    2.6.22-rc4-mm2-pidns1:
    - [Pavel Emelianov, Alexey Dobriyan] Back out the change to use
    task_active_pid_ns() in child_reaper() since task->nsproxy
    can be NULL during task exit (so child_reaper() continues to
    use init_pid_ns).

    to implement child_reaper() since init_pid_ns.child_reaper to
    implement child_reaper() since tsk->nsproxy can be NULL during exit.

    2.6.21-rc6-mm1:
    - Rename task_pid_ns() to task_active_pid_ns() to reflect that a
    process can have multiple pid namespaces.

    Signed-off-by: Sukadev Bhattiprolu
    Acked-by: Pavel Emelianov
    Cc: Eric W. Biederman
    Cc: Cedric Le Goater
    Cc: Dave Hansen
    Cc: Serge Hallyn
    Cc: Herbert Poetzel
    Cc: Kirill Korotaev
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • Add kmem_cache to pid_namespace to allocate pids from.

    Since both implementations expand the struct pid to carry more numerical
    values each namespace should have separate cache to store pids of different
    sizes.

    Each kmem cache is name "pid_", where is the number of numerical ids
    on the pid. Different namespaces with same level of nesting will have same
    caches.

    This patch has two FIXMEs that are to be fixed after we reach the consensus
    about the struct pid itself.

    The first one is that the namespace to free the pid from in free_pid() must be
    taken from pid. Now the init_pid_ns is used.

    The second FIXME is about the cache allocation. When we do know how long the
    object will be then we'll have to calculate this size in create_pid_cachep.
    Right now the sizeof(struct pid) value is used.

    [akpm@linux-foundation.org: coding-style repair]
    Signed-off-by: Pavel Emelianov
    Acked-by: Cedric Le Goater
    Acked-by: Sukadev Bhattiprolu
    Cc: Kirill Korotaev
    Cc: "Eric W. Biederman"
    Cc: Herbert Poetzl
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelianov
     
  • Make get_pid_ns() return the namespace itself to look like the other getters
    and make the code using it look nicer.

    Signed-off-by: Pavel Emelianov
    Acked-by: Cedric Le Goater
    Cc: Kirill Korotaev
    Cc: "Eric W. Biederman"
    Cc: Herbert Poetzl
    Cc: Sukadev Bhattiprolu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelianov
     

17 Jul, 2007

1 commit

  • While working on unshare support for the network namespace I noticed we
    were putting clone flags in an int. Which is weird because the syscall
    uses unsigned long and we at least need an unsigned to properly hold all of
    the unshare flags.

    So to make the code consistent, this patch updates the code to use
    unsigned long instead of int for the clone flags in those places
    where we get it wrong today.

    Signed-off-by: Eric W. Biederman
    Acked-by: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

09 May, 2007

1 commit

  • sys_clone() and sys_unshare() both makes copies of nsproxy and its associated
    namespaces. But they have different code paths.

    This patch merges all the nsproxy and its associated namespace copy/clone
    handling (as much as possible). Posted on container list earlier for
    feedback.

    - Create a new nsproxy and its associated namespaces and pass it back to
    caller to attach it to right process.

    - Changed all copy_*_ns() routines to return a new copy of namespace
    instead of attaching it to task->nsproxy.

    - Moved the CAP_SYS_ADMIN checks out of copy_*_ns() routines.

    - Removed unnessary !ns checks from copy_*_ns() and added BUG_ON()
    just incase.

    - Get rid of all individual unshare_*_ns() routines and make use of
    copy_*_ns() instead.

    [akpm@osdl.org: cleanups, warning fix]
    [clg@fr.ibm.com: remove dup_namespaces() declaration]
    [serue@us.ibm.com: fix CONFIG_IPC_NS=n, clone(CLONE_NEWIPC) retval]
    [akpm@linux-foundation.org: fix build with CONFIG_SYSVIPC=n]
    Signed-off-by: Badari Pulavarty
    Signed-off-by: Serge Hallyn
    Cc: Cedric Le Goater
    Cc: "Eric W. Biederman"
    Cc:
    Signed-off-by: Cedric Le Goater
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Badari Pulavarty
     

31 Jan, 2007

1 commit

  • This is based on a patch by Eric W. Biederman, who pointed out that pid
    namespaces are still fake, and we only have one ever active.

    So for the time being, we can modify any code which could access
    tsk->nsproxy->pid_ns during task exit to just use &init_pid_ns instead,
    and move the exit_task_namespaces call in do_exit() back above
    exit_notify(), so that an exiting nfs server has a valid tsk->sighand to
    work with.

    Long term, pulling pid_ns out of nsproxy might be the cleanest solution.

    Signed-off-by: Eric W. Biederman

    [ Eric's patch fixed to take care of free_pid() too ]

    Signed-off-by: Serge E. Hallyn
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     

09 Dec, 2006

3 commits

  • Add a per pid_namespace child-reaper. This is needed so processes are reaped
    within the same pid space and do not spill over to the parent pid space. Its
    also needed so containers preserve existing semantic that pid == 1 would reap
    orphaned children.

    This is based on Eric Biederman's patch: http://lkml.org/lkml/2006/2/6/285

    Signed-off-by: Sukadev Bhattiprolu
    Signed-off-by: Cedric Le Goater
    Cc: Kirill Korotaev
    Cc: Eric W. Biederman
    Cc: Herbert Poetzl
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • Add the pid namespace framework to the nsproxy object. The copy of the pid
    namespace only increases the refcount on the global pid namespace,
    init_pid_ns, and unshare is not implemented.

    There is no configuration option to activate or deactivate this feature
    because this not relevant for the moment.

    Signed-off-by: Cedric Le Goater
    Cc: Kirill Korotaev
    Cc: Eric W. Biederman
    Cc: Herbert Poetzl
    Cc: Sukadev Bhattiprolu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cedric Le Goater
     
  • Rename struct pspace to struct pid_namespace for consistency with other
    namespaces (uts_namespace and ipc_namespace). Also rename
    include/linux/pspace.h to include/linux/pid_namespace.h and variables from
    pspace to pid_ns.

    Signed-off-by: Sukadev Bhattiprolu
    Signed-off-by: Cedric Le Goater
    Cc: Kirill Korotaev
    Cc: Eric W. Biederman
    Cc: Herbert Poetzl
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu