02 May, 2013

2 commits

  • Pull VFS updates from Al Viro,

    Misc cleanups all over the place, mainly wrt /proc interfaces (switch
    create_proc_entry to proc_create(), get rid of the deprecated
    create_proc_read_entry() in favor of using proc_create_data() and
    seq_file etc).

    7kloc removed.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
    don't bother with deferred freeing of fdtables
    proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
    proc: Make the PROC_I() and PDE() macros internal to procfs
    proc: Supply a function to remove a proc entry by PDE
    take cgroup_open() and cpuset_open() to fs/proc/base.c
    ppc: Clean up scanlog
    ppc: Clean up rtas_flash driver somewhat
    hostap: proc: Use remove_proc_subtree()
    drm: proc: Use remove_proc_subtree()
    drm: proc: Use minor->index to label things, not PDE->name
    drm: Constify drm_proc_list[]
    zoran: Don't print proc_dir_entry data in debug
    reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
    proc: Supply an accessor for getting the data from a PDE's parent
    airo: Use remove_proc_subtree()
    rtl8192u: Don't need to save device proc dir PDE
    rtl8187se: Use a dir under /proc/net/r8180/
    proc: Add proc_mkdir_data()
    proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
    proc: Move PDE_NET() to fs/proc/proc_net.c
    ...

    Linus Torvalds
     
  • Split the proc namespace stuff out into linux/proc_ns.h.

    Signed-off-by: David Howells
    cc: netdev@vger.kernel.org
    cc: Serge E. Hallyn
    cc: Eric W. Biederman
    Signed-off-by: Al Viro

    David Howells
     

27 Mar, 2013

1 commit

  • Only allow unprivileged mounts of proc and sysfs if they are already
    mounted when the user namespace is created.

    proc and sysfs are interesting because they have content that is
    per namespace, and so fresh mounts are needed when new namespaces
    are created while at the same time proc and sysfs have content that
    is shared between every instance.

    Respect the policy of who may see the shared content of proc and sysfs
    by only allowing new mounts if there was an existing mount at the time
    the user namespace was created.

    In practice there are only two interesting cases: proc and sysfs are
    mounted at their usual places, proc and sysfs are not mounted at all
    (some form of mount namespace jail).

    Cc: stable@vger.kernel.org
    Acked-by: Serge Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

28 Feb, 2013

1 commit

  • I'm not sure why, but the hlist for each entry iterators were conceived

    list_for_each_entry(pos, head, member)

    The hlist ones were greedy and wanted an extra parameter:

    hlist_for_each_entry(tpos, pos, head, member)

    Why did they need an extra pos parameter? I'm not quite sure. Not only
    they don't really need it, it also prevents the iterator from looking
    exactly like the list iterator, which is unfortunate.

    Besides the semantic patch, there was some manual work required:

    - Fix up the actual hlist iterators in linux/list.h
    - Fix up the declaration of other iterators based on the hlist ones.
    - A very small amount of places were using the 'node' parameter, this
    was modified to use 'obj->member' instead.
    - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
    properly, so those had to be fixed up manually.

    The semantic patch which is mostly the work of Peter Senna Tschudin is here:

    @@
    iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

    type T;
    expression a,c,d,e;
    identifier b;
    statement S;
    @@

    -T b;

    [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
    [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
    [akpm@linux-foundation.org: checkpatch fixes]
    [akpm@linux-foundation.org: fix warnings]
    [akpm@linux-foudnation.org: redo intrusive kvm changes]
    Tested-by: Peter Senna Tschudin
    Acked-by: Paul E. McKenney
    Signed-off-by: Sasha Levin
    Cc: Wu Fengguang
    Cc: Marcelo Tosatti
    Cc: Gleb Natapov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

27 Jan, 2013

1 commit

  • When freeing a deeply nested user namespace free_user_ns calls
    put_user_ns on it's parent which may in turn call free_user_ns again.
    When -fno-optimize-sibling-calls is passed to gcc one stack frame per
    user namespace is left on the stack, potentially overflowing the
    kernel stack. CONFIG_FRAME_POINTER forces -fno-optimize-sibling-calls
    so we can't count on gcc to optimize this code.

    Remove struct kref and use a plain atomic_t. Making the code more
    flexible and easier to comprehend. Make the loop in free_user_ns
    explict to guarantee that the stack does not overflow with
    CONFIG_FRAME_POINTER enabled.

    I have tested this fix with a simple program that uses unshare to
    create a deeply nested user namespace structure and then calls exit.
    With 1000 nesteuser namespaces before this change running my test
    program causes the kernel to die a horrible death. With 10,000,000
    nested user namespaces after this change my test program runs to
    completion and causes no harm.

    Acked-by: Serge Hallyn
    Pointed-out-by: Vasily Kulikov
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

20 Nov, 2012

1 commit

  • Assign a unique proc inode to each namespace, and use that
    inode number to ensure we only allocate at most one proc
    inode for every namespace in proc.

    A single proc inode per namespace allows userspace to test
    to see if two processes are in the same namespace.

    This has been a long requested feature and only blocked because
    a naive implementation would put the id in a global space and
    would ultimately require having a namespace for the names of
    namespaces, making migration and certain virtualization tricks
    impossible.

    We still don't have per superblock inode numbers for proc, which
    appears necessary for application unaware checkpoint/restart and
    migrations (if the application is using namespace file descriptors)
    but that is now allowd by the design if it becomes important.

    I have preallocated the ipc and uts initial proc inode numbers so
    their structures can be statically initialized.

    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

18 Sep, 2012

1 commit

  • Implement kprojid_t a cousin of the kuid_t and kgid_t.

    The per user namespace mapping of project id values can be set with
    /proc//projid_map.

    A full compliment of helpers is provided: make_kprojid, from_kprojid,
    from_kprojid_munged, kporjid_has_mapping, projid_valid, projid_eq,
    projid_eq, projid_lt.

    Project identifiers are part of the generic disk quota interface,
    although it appears only xfs implements project identifiers currently.

    The xfs code allows anyone who has permission to set the project
    identifier on a file to use any project identifier so when
    setting up the user namespace project identifier mappings I do
    not require a capability.

    Cc: Dave Chinner
    Cc: Jan Kara
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

20 May, 2012

1 commit

  • On 32bit builds gcc says:
    kernel/user.c:30:4: warning: this decimal constant is unsigned only in ISO C90 [enabled by default]
    kernel/user.c:38:4: warning: this decimal constant is unsigned only in ISO C90 [enabled by default]

    Silence gcc by changing the constant 4294967295 to 4294967295U.

    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

26 Apr, 2012

2 commits

  • - Convert the old uid mapping functions into compatibility wrappers
    - Add a uid/gid mapping layer from user space uid and gids to kernel
    internal uids and gids that is extent based for simplicty and speed.
    * Working with number space after mapping uids/gids into their kernel
    internal version adds only mapping complexity over what we have today,
    leaving the kernel code easy to understand and test.
    - Add proc files /proc/self/uid_map /proc/self/gid_map
    These files display the mapping and allow a mapping to be added
    if a mapping does not exist.
    - Allow entering the user namespace without a uid or gid mapping.
    Since we are starting with an existing user our uids and gids
    still have global mappings so are still valid and useful they just don't
    have local mappings. The requirement for things to work are global uid
    and gid so it is odd but perfectly fine not to have a local uid
    and gid mapping.
    Not requiring global uid and gid mappings greatly simplifies
    the logic of setting up the uid and gid mappings by allowing
    the mappings to be set after the namespace is created which makes the
    slight weirdness worth it.
    - Make the mappings in the initial user namespace to the global
    uid/gid space explicit. Today it is an identity mapping
    but in the future we may want to twist this for debugging, similar
    to what we do with jiffies.
    - Document the memory ordering requirements of setting the uid and
    gid mappings. We only allow the mappings to be set once
    and there are no pointers involved so the requirments are
    trivial but a little atypical.

    Performance:

    In this scheme for the permission checks the performance is expected to
    stay the same as the actuall machine instructions should remain the same.

    The worst case I could think of is ls -l on a large directory where
    all of the stat results need to be translated with from kuids and
    kgids to uids and gids. So I benchmarked that case on my laptop
    with a dual core hyperthread Intel i5-2520M cpu with 3M of cpu cache.

    My benchmark consisted of going to single user mode where nothing else
    was running. On an ext4 filesystem opening 1,000,000 files and looping
    through all of the files 1000 times and calling fstat on the
    individuals files. This was to ensure I was benchmarking stat times
    where the inodes were in the kernels cache, but the inode values were
    not in the processors cache. My results:

    v3.4-rc1: ~= 156ns (unmodified v3.4-rc1 with user namespace support disabled)
    v3.4-rc1-userns-: ~= 155ns (v3.4-rc1 with my user namespace patches and user namespace support disabled)
    v3.4-rc1-userns+: ~= 164ns (v3.4-rc1 with my user namespace patches and user namespace support enabled)

    All of the configurations ran in roughly 120ns when I performed tests
    that ran in the cpu cache.

    So in summary the performance impact is:
    1ns improvement in the worst case with user namespace support compiled out.
    8ns aka 5% slowdown in the worst case with user namespace support compiled in.

    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • - Transform userns->creator from a user_struct reference to a simple
    kuid_t, kgid_t pair.

    In cap_capable this allows the check to see if we are the creator of
    a namespace to become the classic suser style euid permission check.

    This allows us to remove the need for a struct cred in the mapping
    functions and still be able to dispaly the user namespace creators
    uid and gid as 0.

    - Remove the now unnecessary delayed_work in free_user_ns.

    All that is left for free_user_ns to do is to call kmem_cache_free
    and put_user_ns. Those functions can be called in any context
    so call them directly from free_user_ns removing the need for delayed work.

    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

08 Apr, 2012

2 commits

  • Modify alloc_uid to take a kuid and make the user hash table global.
    Stop holding a reference to the user namespace in struct user_struct.

    This simplifies the code and makes the per user accounting not
    care about which user namespace a uid happens to appear in.

    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • With a user_ns reference in struct cred the only user of the user namespace
    reference in struct user_struct is to keep the uid hash table alive.

    The user_namespace reference in struct user_struct will be going away soon, and
    I have removed all of the references. Rename the field from user_ns to _user_ns
    so that the compiler can verify nothing follows the user struct to the user
    namespace anymore.

    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

31 Oct, 2011

1 commit

  • The changed files were only including linux/module.h for the
    EXPORT_SYMBOL infrastructure, and nothing else. Revector them
    onto the isolated export header for faster compile times.

    Nothing to see here but a whole lot of instances of:

    -#include
    +#include

    This commit is only changing the kernel dir; next targets
    will probably be mm, fs, the arch dirs, etc.

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     

24 Mar, 2011

1 commit

  • The expected course of development for user namespaces targeted
    capabilities is laid out at https://wiki.ubuntu.com/UserNamespace.

    Goals:

    - Make it safe for an unprivileged user to unshare namespaces. They
    will be privileged with respect to the new namespace, but this should
    only include resources which the unprivileged user already owns.

    - Provide separate limits and accounting for userids in different
    namespaces.

    Status:

    Currently (as of 2.6.38) you can clone with the CLONE_NEWUSER flag to
    get a new user namespace if you have the CAP_SYS_ADMIN, CAP_SETUID, and
    CAP_SETGID capabilities. What this gets you is a whole new set of
    userids, meaning that user 500 will have a different 'struct user' in
    your namespace than in other namespaces. So any accounting information
    stored in struct user will be unique to your namespace.

    However, throughout the kernel there are checks which

    - simply check for a capability. Since root in a child namespace
    has all capabilities, this means that a child namespace is not
    constrained.

    - simply compare uid1 == uid2. Since these are the integer uids,
    uid 500 in namespace 1 will be said to be equal to uid 500 in
    namespace 2.

    As a result, the lxc implementation at lxc.sf.net does not use user
    namespaces. This is actually helpful because it leaves us free to
    develop user namespaces in such a way that, for some time, user
    namespaces may be unuseful.

    Bugs aside, this patchset is supposed to not at all affect systems which
    are not actively using user namespaces, and only restrict what tasks in
    child user namespace can do. They begin to limit privilege to a user
    namespace, so that root in a container cannot kill or ptrace tasks in the
    parent user namespace, and can only get world access rights to files.
    Since all files currently belong to the initila user namespace, that means
    that child user namespaces can only get world access rights to *all*
    files. While this temporarily makes user namespaces bad for system
    containers, it starts to get useful for some sandboxing.

    I've run the 'runltplite.sh' with and without this patchset and found no
    difference.

    This patch:

    copy_process() handles CLONE_NEWUSER before the rest of the namespaces.
    So in the case of clone(CLONE_NEWUSER|CLONE_NEWUTS) the new uts namespace
    will have the new user namespace as its owner. That is what we want,
    since we want root in that new userns to be able to have privilege over
    it.

    Changelog:
    Feb 15: don't set uts_ns->user_ns if we didn't create
    a new uts_ns.
    Feb 23: Move extern init_user_ns declaration from
    init/version.c to utsname.h.

    Signed-off-by: Serge E. Hallyn
    Acked-by: "Eric W. Biederman"
    Acked-by: Daniel Lezcano
    Acked-by: David Howells
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     

30 Dec, 2010

1 commit

  • When racing on adding into user cache, the new allocated from mm slab
    is freed without putting user namespace.

    Since the user namespace is already operated by getting, putting has
    to be issued.

    Signed-off-by: Hillf Danton
    Acked-by: Serge Hallyn
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Hillf Danton
     

27 Oct, 2010

1 commit

  • free_user() releases uidhash_lock but was missing annotation. Add it.
    This removes following sparse warnings:

    include/linux/spinlock.h:339:9: warning: context imbalance in 'free_user' - unexpected unlock
    kernel/user.c:120:6: warning: context imbalance in 'free_uid' - wrong count at exit

    Signed-off-by: Namhyung Kim
    Cc: Ingo Molnar
    Cc: Dhaval Giani
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namhyung Kim
     

10 May, 2010

1 commit

  • This comment should have been removed together with uids_mutex
    when removing user sched.

    Signed-off-by: Li Zefan
    Cc: Peter Zijlstra
    Cc: Dhaval Giani
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Li Zefan
     

03 Apr, 2010

1 commit


16 Mar, 2010

1 commit


21 Jan, 2010

1 commit

  • Remove the USER_SCHED feature. It has been scheduled to be removed in
    2.6.34 as per http://marc.info/?l=linux-kernel&m=125728479022976&w=2

    Signed-off-by: Dhaval Giani
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Dhaval Giani
     

02 Nov, 2009

1 commit

  • Ingo triggered the following warning:

    WARNING: at lib/debugobjects.c:255 debug_print_object+0x42/0x50()
    Hardware name: System Product Name
    ODEBUG: init active object type: timer_list
    Modules linked in:
    Pid: 2619, comm: dmesg Tainted: G W 2.6.32-rc5-tip+ #5298
    Call Trace:
    [] warn_slowpath_common+0x6a/0x81
    [] ? debug_print_object+0x42/0x50
    [] warn_slowpath_fmt+0x29/0x2c
    [] debug_print_object+0x42/0x50
    [] __debug_object_init+0x279/0x2d7
    [] debug_object_init+0x13/0x18
    [] init_timer_key+0x17/0x6f
    [] free_uid+0x50/0x6c
    [] put_cred_rcu+0x61/0x72
    [] rcu_do_batch+0x70/0x121

    debugobjects warns about an enqueued timer being initialized. If
    CONFIG_USER_SCHED=y the user management code uses delayed work to
    remove the user from the hash table and tear down the sysfs objects.

    free_uid is called from RCU and initializes/schedules delayed work if
    the usage count of the user_struct is 0. The init/schedule happens
    outside of the uidhash_lock protected region which allows a concurrent
    caller of find_user() to reference the about to be destroyed
    user_struct w/o preventing the work from being scheduled. If the next
    free_uid call happens before the work timer expired then the active
    timer is initialized and the work scheduled again.

    The race was introduced in commit 5cb350ba (sched: group scheduling,
    sysfs tunables) and made more prominent by commit 3959214f (sched:
    delayed cleanup of user_struct)

    Move the init/schedule_delayed_work inside of the uidhash_lock
    protected region to prevent the race.

    Signed-off-by: Thomas Gleixner
    Acked-by: Dhaval Giani
    Cc: Paul E. McKenney
    Cc: Kay Sievers
    Cc: stable@kernel.org

    Thomas Gleixner
     

16 Jun, 2009

1 commit

  • During bootup performance tracing we see repeated occurrences of
    /sys/kernel/uid/* events for the same uid, leading to a,
    in this case, rather pointless userspace processing for the
    same uid over and over.

    This is usually caused by tools which change their uid to "nobody",
    to run without privileges to read data supplied by untrusted users.

    This change delays the execution of the (already existing) scheduled
    work, to cleanup the uid after one second, so the allocated and announced
    uid can possibly be re-used by another process.

    This is the current behavior, where almost every invocation of a
    binary, which changes the uid, creates two events:
    $ read START < /sys/kernel/uevent_seqnum; \
    for i in `seq 100`; do su --shell=/bin/true bin; done; \
    read END < /sys/kernel/uevent_seqnum; \
    echo $(($END - $START))
    178

    With the delayed cleanup, we get only two events, and userspace finishes
    a bit faster too:
    $ read START < /sys/kernel/uevent_seqnum; \
    for i in `seq 100`; do su --shell=/bin/true bin; done; \
    read END < /sys/kernel/uevent_seqnum; \
    echo $(($END - $START))
    1

    Acked-by: Dhaval Giani
    Signed-off-by: Kay Sievers
    Signed-off-by: Greg Kroah-Hartman

    Kay Sievers
     

24 Mar, 2009

1 commit


11 Mar, 2009

1 commit


27 Feb, 2009

2 commits

  • Impact: fix hung task with certain (non-default) rt-limit settings

    Corey Hickey reported that on using setuid to change the uid of a
    rt process, the process would be unkillable and not be running.
    This is because there was no rt runtime for that user group. Add
    in a check to see if a user can attach an rt task to its task group.
    On failure, return EINVAL, which is also returned in
    CONFIG_CGROUP_SCHED.

    Reported-by: Corey Hickey
    Signed-off-by: Dhaval Giani
    Acked-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Dhaval Giani
     
  • per-uid keys were looked by uid only. Use the user namespace
    to distinguish the same uid in different namespaces.

    This does not address key_permission. So a task can for instance
    try to join a keyring owned by the same uid in another namespace.
    That will be handled by a separate patch.

    Signed-off-by: Serge E. Hallyn
    Acked-by: David Howells
    Signed-off-by: James Morris

    Serge E. Hallyn
     

14 Feb, 2009

1 commit

  • uids in namespaces other than init don't get a sysfs entry.

    For those in the init namespace, while we're waiting to remove
    the sysfs entry for the uid the uid is still hashed, and
    alloc_uid() may re-grab that uid without getting a new
    reference to the user_ns, which we've already put in free_user
    before scheduling remove_user_sysfs_dir().

    Reported-and-tested-by: KOSAKI Motohiro
    Signed-off-by: Serge E. Hallyn
    Acked-by: David Howells
    Tested-by: Ingo Molnar
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     

29 Dec, 2008

1 commit

  • …/git/tip/linux-2.6-tip

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (31 commits)
    sched: fix warning in fs/proc/base.c
    schedstat: consolidate per-task cpu runtime stats
    sched: use RCU variant of list traversal in for_each_leaf_rt_rq()
    sched, cpuacct: export percpu cpuacct cgroup stats
    sched, cpuacct: refactoring cpuusage_read / cpuusage_write
    sched: optimize update_curr()
    sched: fix wakeup preemption clock
    sched: add missing arch_update_cpu_topology() call
    sched: let arch_update_cpu_topology indicate if topology changed
    sched: idle_balance() does not call load_balance_newidle()
    sched: fix sd_parent_degenerate on non-numa smp machine
    sched: add uid information to sched_debug for CONFIG_USER_SCHED
    sched: move double_unlock_balance() higher
    sched: update comment for move_task_off_dead_cpu
    sched: fix inconsistency when redistribute per-cpu tg->cfs_rq shares
    sched/rt: removed unneeded defintion
    sched: add hierarchical accounting to cpu accounting controller
    sched: include group statistics in /proc/sched_debug
    sched: rename SCHED_NO_NO_OMIT_FRAME_POINTER => SCHED_OMIT_FRAME_POINTER
    sched: clean up SCHED_CPUMASK_ALLOC
    ...

    Linus Torvalds
     

09 Dec, 2008

1 commit

  • Documented the currently bogus state of support for CFS user groups with
    user namespaces. In particular, all users in a user namespace should be
    children of the user which created the user namespace. This is yet to
    be implemented.

    Signed-off-by: Serge E. Hallyn
    Acked-by: Dhaval Giani

    Signed-off-by: Serge E. Hallyn
    Signed-off-by: James Morris

    Serge E. Hallyn
     

08 Dec, 2008

1 commit

  • (These two patches are in the next-unacked branch of
    git://git.kernel.org/pub/scm/linux/kernel/git/sergeh/userns-2.6.
    If they get some ACKs, then I hope to feed this into security-next.
    After these two, I think we're ready to tackle userns+capabilities)

    Fairsched creates a per-uid directory under /sys/kernel/uids/.
    So when you clone(CLONE_NEWUSER), it tries to create
    /sys/kernel/uids/0, which already exists, and you get back
    -ENOMEM.

    This was supposed to be fixed by sysfs tagging, but that
    was postponed (ok, rejected until sysfs locking is fixed).
    So, just as with network namespaces, we just don't create
    those directories for user namespaces other than the init.

    Signed-off-by: Serge E. Hallyn
    Signed-off-by: James Morris

    Serge E. Hallyn
     

02 Dec, 2008

1 commit


25 Nov, 2008

2 commits

  • Fix up the last current_user()->user_ns instance to use
    current_user_ns().

    Signed-off-by: Serge E. Hallyn

    Serge Hallyn
     
  • The user_ns is moved from nsproxy to user_struct, so that a struct
    cred by itself is sufficient to determine access (which it otherwise
    would not be). Corresponding ecryptfs fixes (by David Howells) are
    here as well.

    Fix refcounting. The following rules now apply:
    1. The task pins the user struct.
    2. The user struct pins its user namespace.
    3. The user namespace pins the struct user which created it.

    User namespaces are cloned during copy_creds(). Unsharing a new user_ns
    is no longer possible. (We could re-add that, but it'll cause code
    duplication and doesn't seem useful if PAM doesn't need to clone user
    namespaces).

    When a user namespace is created, its first user (uid 0) gets empty
    keyrings and a clean group_info.

    This incorporates a previous patch by David Howells. Here
    is his original patch description:

    >I suggest adding the attached incremental patch. It makes the following
    >changes:
    >
    > (1) Provides a current_user_ns() macro to wrap accesses to current's user
    > namespace.
    >
    > (2) Fixes eCryptFS.
    >
    > (3) Renames create_new_userns() to create_user_ns() to be more consistent
    > with the other associated functions and because the 'new' in the name is
    > superfluous.
    >
    > (4) Moves the argument and permission checks made for CLONE_NEWUSER to the
    > beginning of do_fork() so that they're done prior to making any attempts
    > at allocation.
    >
    > (5) Calls create_user_ns() after prepare_creds(), and gives it the new creds
    > to fill in rather than have it return the new root user. I don't imagine
    > the new root user being used for anything other than filling in a cred
    > struct.
    >
    > This also permits me to get rid of a get_uid() and a free_uid(), as the
    > reference the creds were holding on the old user_struct can just be
    > transferred to the new namespace's creator pointer.
    >
    > (6) Makes create_user_ns() reset the UIDs and GIDs of the creds under
    > preparation rather than doing it in copy_creds().
    >
    >David

    >Signed-off-by: David Howells

    Changelog:
    Oct 20: integrate dhowells comments
    1. leave thread_keyring alone
    2. use current_user_ns() in set_user()

    Signed-off-by: Serge Hallyn

    Serge Hallyn
     

14 Nov, 2008

2 commits

  • Inaugurate copy-on-write credentials management. This uses RCU to manage the
    credentials pointer in the task_struct with respect to accesses by other tasks.
    A process may only modify its own credentials, and so does not need locking to
    access or modify its own credentials.

    A mutex (cred_replace_mutex) is added to the task_struct to control the effect
    of PTRACE_ATTACHED on credential calculations, particularly with respect to
    execve().

    With this patch, the contents of an active credentials struct may not be
    changed directly; rather a new set of credentials must be prepared, modified
    and committed using something like the following sequence of events:

    struct cred *new = prepare_creds();
    int ret = blah(new);
    if (ret < 0) {
    abort_creds(new);
    return ret;
    }
    return commit_creds(new);

    There are some exceptions to this rule: the keyrings pointed to by the active
    credentials may be instantiated - keyrings violate the COW rule as managing
    COW keyrings is tricky, given that it is possible for a task to directly alter
    the keys in a keyring in use by another task.

    To help enforce this, various pointers to sets of credentials, such as those in
    the task_struct, are declared const. The purpose of this is compile-time
    discouragement of altering credentials through those pointers. Once a set of
    credentials has been made public through one of these pointers, it may not be
    modified, except under special circumstances:

    (1) Its reference count may incremented and decremented.

    (2) The keyrings to which it points may be modified, but not replaced.

    The only safe way to modify anything else is to create a replacement and commit
    using the functions described in Documentation/credentials.txt (which will be
    added by a later patch).

    This patch and the preceding patches have been tested with the LTP SELinux
    testsuite.

    This patch makes several logical sets of alteration:

    (1) execve().

    This now prepares and commits credentials in various places in the
    security code rather than altering the current creds directly.

    (2) Temporary credential overrides.

    do_coredump() and sys_faccessat() now prepare their own credentials and
    temporarily override the ones currently on the acting thread, whilst
    preventing interference from other threads by holding cred_replace_mutex
    on the thread being dumped.

    This will be replaced in a future patch by something that hands down the
    credentials directly to the functions being called, rather than altering
    the task's objective credentials.

    (3) LSM interface.

    A number of functions have been changed, added or removed:

    (*) security_capset_check(), ->capset_check()
    (*) security_capset_set(), ->capset_set()

    Removed in favour of security_capset().

    (*) security_capset(), ->capset()

    New. This is passed a pointer to the new creds, a pointer to the old
    creds and the proposed capability sets. It should fill in the new
    creds or return an error. All pointers, barring the pointer to the
    new creds, are now const.

    (*) security_bprm_apply_creds(), ->bprm_apply_creds()

    Changed; now returns a value, which will cause the process to be
    killed if it's an error.

    (*) security_task_alloc(), ->task_alloc_security()

    Removed in favour of security_prepare_creds().

    (*) security_cred_free(), ->cred_free()

    New. Free security data attached to cred->security.

    (*) security_prepare_creds(), ->cred_prepare()

    New. Duplicate any security data attached to cred->security.

    (*) security_commit_creds(), ->cred_commit()

    New. Apply any security effects for the upcoming installation of new
    security by commit_creds().

    (*) security_task_post_setuid(), ->task_post_setuid()

    Removed in favour of security_task_fix_setuid().

    (*) security_task_fix_setuid(), ->task_fix_setuid()

    Fix up the proposed new credentials for setuid(). This is used by
    cap_set_fix_setuid() to implicitly adjust capabilities in line with
    setuid() changes. Changes are made to the new credentials, rather
    than the task itself as in security_task_post_setuid().

    (*) security_task_reparent_to_init(), ->task_reparent_to_init()

    Removed. Instead the task being reparented to init is referred
    directly to init's credentials.

    NOTE! This results in the loss of some state: SELinux's osid no
    longer records the sid of the thread that forked it.

    (*) security_key_alloc(), ->key_alloc()
    (*) security_key_permission(), ->key_permission()

    Changed. These now take cred pointers rather than task pointers to
    refer to the security context.

    (4) sys_capset().

    This has been simplified and uses less locking. The LSM functions it
    calls have been merged.

    (5) reparent_to_kthreadd().

    This gives the current thread the same credentials as init by simply using
    commit_thread() to point that way.

    (6) __sigqueue_alloc() and switch_uid()

    __sigqueue_alloc() can't stop the target task from changing its creds
    beneath it, so this function gets a reference to the currently applicable
    user_struct which it then passes into the sigqueue struct it returns if
    successful.

    switch_uid() is now called from commit_creds(), and possibly should be
    folded into that. commit_creds() should take care of protecting
    __sigqueue_alloc().

    (7) [sg]et[ug]id() and co and [sg]et_current_groups.

    The set functions now all use prepare_creds(), commit_creds() and
    abort_creds() to build and check a new set of credentials before applying
    it.

    security_task_set[ug]id() is called inside the prepared section. This
    guarantees that nothing else will affect the creds until we've finished.

    The calling of set_dumpable() has been moved into commit_creds().

    Much of the functionality of set_user() has been moved into
    commit_creds().

    The get functions all simply access the data directly.

    (8) security_task_prctl() and cap_task_prctl().

    security_task_prctl() has been modified to return -ENOSYS if it doesn't
    want to handle a function, or otherwise return the return value directly
    rather than through an argument.

    Additionally, cap_task_prctl() now prepares a new set of credentials, even
    if it doesn't end up using it.

    (9) Keyrings.

    A number of changes have been made to the keyrings code:

    (a) switch_uid_keyring(), copy_keys(), exit_keys() and suid_keys() have
    all been dropped and built in to the credentials functions directly.
    They may want separating out again later.

    (b) key_alloc() and search_process_keyrings() now take a cred pointer
    rather than a task pointer to specify the security context.

    (c) copy_creds() gives a new thread within the same thread group a new
    thread keyring if its parent had one, otherwise it discards the thread
    keyring.

    (d) The authorisation key now points directly to the credentials to extend
    the search into rather pointing to the task that carries them.

    (e) Installing thread, process or session keyrings causes a new set of
    credentials to be created, even though it's not strictly necessary for
    process or session keyrings (they're shared).

    (10) Usermode helper.

    The usermode helper code now carries a cred struct pointer in its
    subprocess_info struct instead of a new session keyring pointer. This set
    of credentials is derived from init_cred and installed on the new process
    after it has been cloned.

    call_usermodehelper_setup() allocates the new credentials and
    call_usermodehelper_freeinfo() discards them if they haven't been used. A
    special cred function (prepare_usermodeinfo_creds()) is provided
    specifically for call_usermodehelper_setup() to call.

    call_usermodehelper_setkeys() adjusts the credentials to sport the
    supplied keyring as the new session keyring.

    (11) SELinux.

    SELinux has a number of changes, in addition to those to support the LSM
    interface changes mentioned above:

    (a) selinux_setprocattr() no longer does its check for whether the
    current ptracer can access processes with the new SID inside the lock
    that covers getting the ptracer's SID. Whilst this lock ensures that
    the check is done with the ptracer pinned, the result is only valid
    until the lock is released, so there's no point doing it inside the
    lock.

    (12) is_single_threaded().

    This function has been extracted from selinux_setprocattr() and put into
    a file of its own in the lib/ directory as join_session_keyring() now
    wants to use it too.

    The code in SELinux just checked to see whether a task shared mm_structs
    with other tasks (CLONE_VM), but that isn't good enough. We really want
    to know if they're part of the same thread group (CLONE_THREAD).

    (13) nfsd.

    The NFS server daemon now has to use the COW credentials to set the
    credentials it is going to use. It really needs to pass the credentials
    down to the functions it calls, but it can't do that until other patches
    in this series have been applied.

    Signed-off-by: David Howells
    Acked-by: James Morris
    Signed-off-by: James Morris

    David Howells
     
  • Separate the task security context from task_struct. At this point, the
    security data is temporarily embedded in the task_struct with two pointers
    pointing to it.

    Note that the Alpha arch is altered as it refers to (E)UID and (E)GID in
    entry.S via asm-offsets.

    With comment fixes Signed-off-by: Marc Dionne

    Signed-off-by: David Howells
    Acked-by: James Morris
    Acked-by: Serge Hallyn
    Signed-off-by: James Morris

    David Howells
     

19 Aug, 2008

1 commit


30 Apr, 2008

1 commit

  • Use kmem_cache_zalloc(), remove large amounts of initialisation code and
    ifdeffery.

    Note: this assumes that memset(*atomic_t, 0) correctly initialises the
    atomic_t. This is true for all present archtiectures and if it becomes false
    for a future architecture then we'll need to make large changes all over the
    place anyway.

    Acked-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

29 Apr, 2008

1 commit

  • Don't generate the per-UID user and user session keyrings unless they're
    explicitly accessed. This solves a problem during a login process whereby
    set*uid() is called before the SELinux PAM module, resulting in the per-UID
    keyrings having the wrong security labels.

    This also cures the problem of multiple per-UID keyrings sometimes appearing
    due to PAM modules (including pam_keyinit) setuiding and causing user_structs
    to come into and go out of existence whilst the session keyring pins the user
    keyring. This is achieved by first searching for extant per-UID keyrings
    before inventing new ones.

    The serial bound argument is also dropped from find_keyring_by_name() as it's
    not currently made use of (setting it to 0 disables the feature).

    Signed-off-by: David Howells
    Cc:
    Cc:
    Cc:
    Cc: Stephen Smalley
    Cc: James Morris
    Cc: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     

20 Apr, 2008

2 commits