20 Nov, 2012

6 commits

  • The task_user_ns function hides the fact that it is getting the user
    namespace from struct cred on the task. struct cred may go away as
    soon as the rcu lock is released. This leads to a race where we
    can dereference a stale user namespace pointer.

    To make it obvious a struct cred is involved kill task_user_ns.

    To kill the race modify the users of task_user_ns to only
    reference the user namespace while the rcu lock is held.

    Cc: Kees Cook
    Cc: James Morris
    Acked-by: Kees Cook
    Acked-by: Serge Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Modify create_new_namespaces to explicitly take a user namespace
    parameter, instead of implicitly through the task_struct.

    This allows an implementation of unshare(CLONE_NEWUSER) where
    the new user namespace is not stored onto the current task_struct
    until after all of the namespaces are created.

    Acked-by: Serge Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • - Push the permission check from the core setns syscall into
    the setns install methods where the user namespace of the
    target namespace can be determined, and used in a ns_capable
    call.

    Acked-by: Serge Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • If an unprivileged user has the appropriate capabilities in their
    current user namespace allow the creation of new namespaces.

    Acked-by: Serge Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Acked-by: Serge Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • - Allow chown if CAP_CHOWN is present in the current user namespace
    and the uid of the inode maps into the current user namespace, and
    the destination uid or gid maps into the current user namespace.

    - Allow perserving setgid when changing an inode if CAP_FSETID is
    present in the current user namespace and the owner of the file has
    a mapping into the current user namespace.

    Acked-by: Serge E. Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

19 Nov, 2012

22 commits

  • Now that we have been through every permission check in the kernel
    having uid == 0 and gid == 0 in your local user namespace no
    longer adds any special privileges. Even having a full set
    of caps in your local user namespace is safe because capabilies
    are relative to your local user namespace, and do not confer
    unexpected privileges.

    Over the long term this should allow much more of the kernels
    functionality to be safely used by non-root users. Functionality
    like unsharing the mount namespace that is only unsafe because
    it can fool applications whose privileges are raised when they
    are executed. Since those applications have no privileges in
    a user namespaces it becomes safe to spoof and confuse those
    applications all you want.

    Those capabilities will still need to be enabled carefully because
    we may still need things like rlimits on the number of unprivileged
    mounts but that is to avoid DOS attacks not to avoid fooling root
    owned processes.

    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • When performing an exec where the binary lives in one user namespace and
    the execing process lives in another usre namespace there is the possibility
    that the target uids can not be represented.

    Instead of failing the exec simply ignore the suid/sgid bits and run
    the binary with lower privileges. We already do this in the case
    of MNT_NOSUID so this should be a well tested code path.

    As the user and group are not changed this should not introduce any
    security issues.

    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • Change return value from -EINVAL to -EPERM when the permission check fails.

    Signed-off-by: Zhao Hongjiang
    Signed-off-by: Eric W. Biederman

    Zhao Hongjiang
     
  • - Add a filesystem flag to mark filesystems that are safe to mount as
    an unprivileged user.

    - Add a filesystem flag to mark filesystems that don't need MNT_NODEV
    when mounted by an unprivileged user.

    - Relax the permission checks to allow unprivileged users that have
    CAP_SYS_ADMIN permissions in the user namespace referred to by the
    current mount namespace to be allowed to mount, unmount, and move
    filesystems.

    Acked-by: "Serge E. Hallyn"
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Sharing mount subtress with mount namespaces created by unprivileged
    users allows unprivileged mounts created by unprivileged users to
    propagate to mount namespaces controlled by privileged users.

    Prevent nasty consequences by changing shared subtrees to slave
    subtress when an unprivileged users creates a new mount namespace.

    Acked-by: Serge Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • This will allow for support for unprivileged mounts in a new user namespace.

    Acked-by: "Serge E. Hallyn"
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • setns support for the mount namespace is a little tricky as an
    arbitrary decision must be made about what to set fs->root and
    fs->pwd to, as there is no expectation of a relationship between
    the two mount namespaces. Therefore I arbitrarily find the root
    mount point, and follow every mount on top of it to find the top
    of the mount stack. Then I set fs->root and fs->pwd to that
    location. The topmost root of the mount stack seems like a
    reasonable place to be.

    Bind mount support for the mount namespace inodes has the
    possibility of creating circular dependencies between mount
    namespaces. Circular dependencies can result in loops that
    prevent mount namespaces from every being freed. I avoid
    creating those circular dependencies by adding a sequence number
    to the mount namespace and require all bind mounts be of a
    younger mount namespace into an older mount namespace.

    Add a helper function proc_ns_inode so it is possible to
    detect when we are attempting to bind mound a namespace inode.

    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • Once you are confined to a user namespace applications can not gain
    privilege and escape the user namespace so there is no longer a reason
    to restrict chroot.

    Acked-by: Serge Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Unsharing of the pid namespace unlike unsharing of other namespaces
    does not take affect immediately. Instead it affects the children
    created with fork and clone. The first of these children becomes the init
    process of the new pid namespace, the rest become oddball children
    of pid 0. From the point of view of the new pid namespace the process
    that created it is pid 0, as it's pid does not map.

    A couple of different semantics were considered but this one was
    settled on because it is easy to implement and it is usable from
    pam modules. The core reasons for the existence of unshare.

    I took a survey of the callers of pam modules and the following
    appears to be a representative sample of their logic.
    {
    setup stuff include pam
    child = fork();
    if (!child) {
    setuid()
    exec /bin/bash
    }
    waitpid(child);

    pam and other cleanup
    }

    As you can see there is a fork to create the unprivileged user
    space process. Which means that the unprivileged user space
    process will appear as pid 1 in the new pid namespace. Further
    most login processes do not cope with extraneous children which
    means shifting the duty of reaping extraneous child process to
    the creator of those extraneous children makes the system more
    comprehensible.

    The practical reason for this set of pid namespace semantics is
    that it is simple to implement and verify they work correctly.
    Whereas an implementation that requres changing the struct
    pid on a process comes with a lot more races and pain. Not
    the least of which is that glibc caches getpid().

    These semantics are implemented by having two notions
    of the pid namespace of a proces. There is task_active_pid_ns
    which is the pid namspace the process was created with
    and the pid namespace that all pids are presented to
    that process in. The task_active_pid_ns is stored
    in the struct pid of the task.

    Then there is the pid namespace that will be used for children
    that pid namespace is stored in task->nsproxy->pid_ns.

    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • Instead of setting child_reaper and SIGNAL_UNKILLABLE one way
    for the system init process, and another way for pid namespace
    init processes test pid->nr == 1 and use the same code for both.

    For the global init this results in SIGNAL_UNKILLABLE being set
    much earlier in the initialization process.

    This is a small cleanup and it paves the way for allowing unshare and
    enter of the pid namespace as that path like our global init also will
    not set CLONE_NEWPID.

    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • - Pid namespaces are designed to be inescapable so verify that the
    passed in pid namespace is a child of the currently active
    pid namespace or the currently active pid namespace itself.

    Allowing the currently active pid namespace is important so
    the effects of an earlier setns can be cancelled.

    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • task_active_pid_ns(current) != current->ns_proxy->pid_ns will
    soon be allowed to support unshare and setns.

    The definition of creating a child pid namespace when
    task_active_pid_ns(current) != current->ns_proxy->pid_ns could be that
    we create a child pid namespace of current->ns_proxy->pid_ns. However
    that leads to strange cases like trying to have a single process be
    init in multiple pid namespaces, which is racy and hard to think
    about.

    The definition of creating a child pid namespace when
    task_active_pid_ns(current) != current->ns_proxy->pid_ns could be that
    we create a child pid namespace of task_active_pid_ns(current). While
    that seems less racy it does not provide any utility.

    Therefore define the semantics of creating a child pid namespace when
    task_active_pid_ns(current) != current->ns_proxy->pid_ns to be that the
    pid namespace creation fails. That is easy to implement and easy
    to think about.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Looking at pid_ns->nr_hashed is a bit simpler and it works for
    disjoint process trees that an unshare or a join of a pid_namespace
    may create.

    Acked-by: "Serge E. Hallyn"
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Set nr_hashed to -1 just before we schedule the work to cleanup proc.
    Test nr_hashed just before we hash a new pid and if nr_hashed is < 0
    fail.

    This guaranteees that processes never enter a pid namespaces after we
    have cleaned up the state to support processes in a pid namespace.

    Currently sending SIGKILL to all of the process in a pid namespace as
    init exists gives us this guarantee but we need something a little
    stronger to support unsharing and joining a pid namespace.

    Acked-by: "Serge E. Hallyn"
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • Track the number of pids in the proc hash table. When the number of
    pids goes to 0 schedule work to unmount the kernel mount of proc.

    Move the mount of proc into alloc_pid when we allocate the pid for
    init.

    Remove the surprising calls of pid_ns_release proc in fork and
    proc_flush_task. Those code paths really shouldn't know about proc
    namespace implementation details and people have demonstrated several
    times that finding and understanding those code paths is difficult and
    non-obvious.

    Because of the call path detach pid is alwasy called with the
    rtnl_lock held free_pid is not allowed to sleep, so the work to
    unmounting proc is moved to a work queue. This has the side benefit
    of not blocking the entire world waiting for the unnecessary
    rcu_barrier in deactivate_locked_super.

    In the process of making the code clear and obvious this fixes a bug
    reported by Gao feng where we would leak a
    mount of proc during clone(CLONE_NEWPID|CLONE_NEWNET) if copy_pid_ns
    succeeded and copy_net_ns failed.

    Acked-by: "Serge E. Hallyn"
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • The expressions tsk->nsproxy->pid_ns and task_active_pid_ns
    aka ns_of_pid(task_pid(tsk)) should have the same number of
    cache line misses with the practical difference that
    ns_of_pid(task_pid(tsk)) is released later in a processes life.

    Furthermore by using task_active_pid_ns it becomes trivial
    to write an unshare implementation for the the pid namespace.

    So I have used task_active_pid_ns everywhere I can.

    In fork since the pid has not yet been attached to the
    process I use ns_of_pid, to achieve the same effect.

    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • - Capture the the user namespace that creates the pid namespace
    - Use that user namespace to test if it is ok to write to
    /proc/sys/kernel/ns_last_pid.

    Zhao Hongjiang noticed I was missing a put_user_ns
    in when destroying a pid_ns. I have foloded his patch into this one
    so that bisects will work properly.

    Acked-by: Serge Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Now that we have s_fs_info pointing to our pid namespace
    the original reason for the proc root inode having a struct
    pid is gone.

    Caching a pid in the root inode has led to some complicated
    code. Now that we don't need the struct pid, just remove it.

    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • I had visions at one point of splitting proc into two filesystems. If
    that had happened proc/self being the the part of proc that actually deals
    with pids would have been a nice cleanup. As it is proc/self requires
    a lot of unnecessary infrastructure for a single file.

    The only user visible change is that a mounted /proc for a pid namespace
    that is dead now shows a broken proc symlink, instead of being completely
    invisible. I don't think anyone will notice or care.

    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • The kbuild test robot report the following error
    when building mips with user namespace support enabled.

    All error/warnings:
    arch/mips/kernel/mips-mt-fpaff.c: In function 'check_same_owner':
    arch/mips/kernel/mips-mt-fpaff.c:53:22: error: invalid operands to binary == (have 'kuid_t' and 'kuid_t')
    arch/mips/kernel/mips-mt-fpaff.c:54:15: error: invalid operands to binary == (have 'kuid_t' and 'kuid_t')

    Replace "a == b" with uid_eq(a, b) removes this error and allows the
    code to work with user namespaces enabled.

    Cc: Ralf Baechle
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • The user namespace which creates a new network namespace owns that
    namespace and all resources created in it. This way we can target
    capability checks for privileged operations against network resources to
    the user_ns which created the network namespace in which the resource
    lives. Privilege to the user namespace which owns the network
    namespace, or any parent user namespace thereof, provides the same
    privilege to the network resource.

    This patch is reworked from a version originally by
    Serge E. Hallyn

    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • The copy of copy_net_ns used when the network stack is not
    built is broken as it does not return -EINVAL when attempting
    to create a new network namespace. We don't even have
    a previous network namespace.

    Since we need a copy of copy_net_ns in net/net_namespace.h that is
    available when the networking stack is not built at all move the
    correct version of copy_net_ns from net_namespace.c into net_namespace.h
    Leaving us with just 2 versions of copy_net_ns. One version for when
    we compile in network namespace suport and another stub for all other
    occasions.

    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

15 Nov, 2012

2 commits

  • Use kuid_t and kgid_t in struct fuse_conn and struct fuse_mount_data.

    The connection between between a fuse filesystem and a fuse daemon is
    established when a fuse filesystem is mounted and provided with a file
    descriptor the fuse daemon created by opening /dev/fuse.

    For now restrict the communication of uids and gids between the fuse
    filesystem and the fuse daemon to the initial user namespace. Enforce
    this by verifying the file descriptor passed to the mount of fuse was
    opened in the initial user namespace. Ensuring the mount happens in
    the initial user namespace is not necessary as mounts from non-initial
    user namespaces are not yet allowed.

    In fuse_req_init_context convert the currrent fsuid and fsgid into the
    initial user namespace for the request that will be sent to the fuse
    daemon.

    In fuse_fill_attr convert the uid and gid passed from the fuse daemon
    from the initial user namespace into kuids and kgids.

    In iattr_to_fattr called from fuse_setattr convert kuids and kgids
    into the uids and gids in the initial user namespace before passing
    them to the fuse filesystem.

    In fuse_change_attributes_common called from fuse_dentry_revalidate,
    fuse_permission, fuse_geattr, and fuse_setattr, and fuse_iget convert
    the uid and gid from the fuse daemon into a kuid and a kgid to store
    on the fuse inode.

    By default fuse mounts are restricted to task whose uid, suid, and
    euid matches the fuse user_id and whose gid, sgid, and egid matches
    the fuse group id. Convert the user_id and group_id mount options
    into kuids and kgids at mount time, and use uid_eq and gid_eq to
    compare the in fuse_allow_task.

    Cc: Miklos Szeredi
    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • Use kuid_t and kgid_t in struct autofs_info and struct autofs_wait_queue.

    When creating directories and symlinks default the uid and gid of
    the mount requester to the global root uid and gid. autofs4_wait
    will update these fields when a mount is requested.

    When generating autofsv5 packets report the uid and gid of the mount
    requestor in user namespace of the process that opened the pipe,
    reporting unmapped uids and gids as overflowuid and overflowgid.

    In autofs_dev_ioctl_requester return the uid and gid of the last mount
    requester converted into the calling processes user namespace. When the
    uid or gid don't map return overflowuid and overflowgid as appropriate,
    allowing failure to find a mount requester to be distinguished from
    failure to map a mount requester.

    The uid and gid mount options specifying the user and group of the
    root autofs inode are converted into kuid and kgid as they are parsed
    defaulting to the current uid and current gid of the process that
    mounts autofs.

    Mounting of autofs for the present remains confined to processes in
    the initial user namespace.

    Cc: Ian Kent
    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

29 Oct, 2012

7 commits

  • Linus Torvalds
     
  • Pull ktest confusion fix from Steven Rostedt:
    "With the v3.7-rc2 kernel, the network cards on my target boxes were
    not being brought up.

    I found that the modules for the network was not being installed.
    This was due to the config CONFIG_MODULES_USE_ELF_RELA that came
    before CONFIG_MODULES, and confused ktest in thinking that
    CONFIG_MODULES=y was not found.

    Ktest needs to test all configs and not just stop if something starts
    with CONFIG_MODULES."

    * tag 'ktest-v3.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-ktest:
    ktest: Fix ktest confusion with CONFIG_MODULES_USE_ELF_RELA

    Linus Torvalds
     
  • Pull minor spi MXS fixes from Mark Brown:
    "These fixes are both pretty minor ones and are driver local."

    * tag 'spi-mxs' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/misc:
    spi: mxs: Terminate DMA in case of DMA timeout
    spi: mxs: Assign message status after transfer finished

    Linus Torvalds
     
  • Pull arm-soc fixes from Arnd Bergmann:
    "Bug fixes for a number of ARM platforms, mostly OMAP, imx and at91.

    These come a little later than I had hoped but unfortunately we had a
    few of these patches cause regressions themselves and had to work out
    how to deal with those in the meantime."

    * tag 'fixes-for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (38 commits)
    Revert "ARM i.MX25: Fix PWM per clock lookups"
    ARM: versatile: fix versatile_defconfig
    ARM: mvebu: update defconfig with 3.7 changes
    ARM: at91: fix at91x40 build
    ARM: socfpga: Fix socfpga compilation with early_printk() enabled
    ARM: SPEAr: Remove unused empty files
    MAINTAINERS: Add arm-soc tree entry
    ARM: dts: mxs: add the "clock-names" for gpmi-nand
    ARM: ux500: Correct SDI5 address and add some format changes
    ARM: ux500: Specify AMBA Primecell IDs for Nomadik I2C in DT
    ARM: ux500: Fix build error relating to IRQCHIP_SKIP_SET_WAKE
    ARM: at91: drop duplicated config SOC_AT91SAM9 entry
    ARM: at91/i2c: change id to let i2c-at91 work
    ARM: at91/i2c: change id to let i2c-gpio work
    ARM: at91/dts: at91sam9g20ek_common: Fix typos in buttons labels.
    ARM: at91: fix external interrupt specification in board code
    ARM: at91: fix external interrupts in non-DT case
    ARM: at91: at91sam9g10: fix SOC type detection
    ARM: at91/tc: fix typo in the DT document
    ARM: AM33XX: Fix configuration of dmtimer parent clock by dmtimer driverDate:Wed, 17 Oct 2012 13:55:55 -0500
    ...

    Linus Torvalds
     
  • Functions generic_file_splice_read and generic_file_splice_write access
    the pagecache directly. For block devices these functions must be locked
    so that block size is not changed while they are in progress.

    This patch is an additional fix for commit b87570f5d349 ("Fix a crash
    when block device is read and block size is changed at the same time")
    that locked aio_read, aio_write and mmap against block size change.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Linus Torvalds

    Mikulas Patocka
     
  • Use rcu_read_lock_sched / rcu_read_unlock_sched / synchronize_sched
    instead of rcu_read_lock / rcu_read_unlock / synchronize_rcu.

    This is an optimization. The RCU-protected region is very small, so
    there will be no latency problems if we disable preempt in this region.

    So we use rcu_read_lock_sched / rcu_read_unlock_sched that translates
    to preempt_disable / preempt_disable. It is smaller (and supposedly
    faster) than preemptible rcu_read_lock / rcu_read_unlock.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Linus Torvalds

    Mikulas Patocka
     
  • This patch introduces new barrier pair light_mb() and heavy_mb() for
    percpu rw semaphores.

    This patch fixes a bug in percpu-rw-semaphores where a barrier was
    missing in percpu_up_write.

    This patch improves performance on the read path of
    percpu-rw-semaphores: on non-x86 cpus, there was a smp_mb() in
    percpu_up_read. This patch changes it to a compiler barrier and removes
    the "#if defined(X86) ..." condition.

    From: Lai Jiangshan
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Linus Torvalds

    Mikulas Patocka
     

27 Oct, 2012

3 commits

  • This reverts commit 92063cee118655d25b50d04eb77b012f3287357a, it
    was applied prematurely, causing this build error for
    imx_v4_v5_defconfig:

    arch/arm/mach-imx/clk-imx25.c: In function 'mx25_clocks_init':
    arch/arm/mach-imx/clk-imx25.c:206:26: error: 'pwm_ipg_per' undeclared (first use in this function)
    arch/arm/mach-imx/clk-imx25.c:206:26: note: each undeclared identifier is reported only once for each function it appears in

    Sascha Hauer explains:
    > There are several gates missing in clk-imx25.c. I have a patch which
    > adds support for them and I seem to have missed that the above depends
    > on it.

    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     
  • With the introduction of CONFIG_ARCH_MULTIPLATFORM, versatile is
    no longer the default platform, so we need to enable
    CONFIG_ARCH_VERSATILE explicitly in order for that to be selected
    rather than the multiplatform configuration.

    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     
  • The split of 370 and XP into two Kconfig options and the multiplatform
    kernel support has changed a few Kconfig symbols, so let's update the
    mvebu_defconfig file with the latest changes.

    Signed-off-by: Thomas Petazzoni
    Signed-off-by: Arnd Bergmann

    Thomas Petazzoni