28 Feb, 2013

1 commit

  • Convert to the much saner new idr interface.

    The new interface doesn't directly translate to the way idr_pre_get()
    was used around ipc_addid() as preloading disables preemption. From
    my cursory reading, it seems like we should be able to do all
    allocation from ipc_addid(), so I moved it there. Can you please
    check whether this would be okay? If this is wrong and ipc_addid()
    should be allowed to be called from non-sleepable context, I'd suggest
    allocating id itself in the outer functions and later install the
    pointer using idr_replace().

    Signed-off-by: Tejun Heo
    Reported-by: Sedat Dilek
    Tested-by: Sedat Dilek
    Cc: Stanislav Kinsbursky
    Cc: "Eric W. Biederman"
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     

27 Feb, 2013

1 commit

  • Pull vfs pile (part one) from Al Viro:
    "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
    locking violations, etc.

    The most visible changes here are death of FS_REVAL_DOT (replaced with
    "has ->d_weak_revalidate()") and a new helper getting from struct file
    to inode. Some bits of preparation to xattr method interface changes.

    Misc patches by various people sent this cycle *and* ocfs2 fixes from
    several cycles ago that should've been upstream right then.

    PS: the next vfs pile will be xattr stuff."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
    saner proc_get_inode() calling conventions
    proc: avoid extra pde_put() in proc_fill_super()
    fs: change return values from -EACCES to -EPERM
    fs/exec.c: make bprm_mm_init() static
    ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
    ocfs2: fix possible use-after-free with AIO
    ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
    get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
    target: writev() on single-element vector is pointless
    export kernel_write(), convert open-coded instances
    fs: encode_fh: return FILEID_INVALID if invalid fid_type
    kill f_vfsmnt
    vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
    nfsd: handle vfs_getattr errors in acl protocol
    switch vfs_getattr() to struct path
    default SET_PERSONALITY() in linux/elf.h
    ceph: prepopulate inodes only when request is aborted
    d_hash_and_lookup(): export, switch open-coded instances
    9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
    9p: split dropping the acls from v9fs_set_create_acl()
    ...

    Linus Torvalds
     

26 Feb, 2013

1 commit

  • Pull user namespace and namespace infrastructure changes from Eric W Biederman:
    "This set of changes starts with a few small enhnacements to the user
    namespace. reboot support, allowing more arbitrary mappings, and
    support for mounting devpts, ramfs, tmpfs, and mqueuefs as just the
    user namespace root.

    I do my best to document that if you care about limiting your
    unprivileged users that when you have the user namespace support
    enabled you will need to enable memory control groups.

    There is a minor bug fix to prevent overflowing the stack if someone
    creates way too many user namespaces.

    The bulk of the changes are a continuation of the kuid/kgid push down
    work through the filesystems. These changes make using uids and gids
    typesafe which ensures that these filesystems are safe to use when
    multiple user namespaces are in use. The filesystems converted for
    3.9 are ceph, 9p, afs, ocfs2, gfs2, ncpfs, nfs, nfsd, and cifs. The
    changes for these filesystems were a little more involved so I split
    the changes into smaller hopefully obviously correct changes.

    XFS is the only filesystem that remains. I was hoping I could get
    that in this release so that user namespace support would be enabled
    with an allyesconfig or an allmodconfig but it looks like the xfs
    changes need another couple of days before it they are ready."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (93 commits)
    cifs: Enable building with user namespaces enabled.
    cifs: Convert struct cifs_ses to use a kuid_t and a kgid_t
    cifs: Convert struct cifs_sb_info to use kuids and kgids
    cifs: Modify struct smb_vol to use kuids and kgids
    cifs: Convert struct cifsFileInfo to use a kuid
    cifs: Convert struct cifs_fattr to use kuid and kgids
    cifs: Convert struct tcon_link to use a kuid.
    cifs: Modify struct cifs_unix_set_info_args to hold a kuid_t and a kgid_t
    cifs: Convert from a kuid before printing current_fsuid
    cifs: Use kuids and kgids SID to uid/gid mapping
    cifs: Pass GLOBAL_ROOT_UID and GLOBAL_ROOT_GID to keyring_alloc
    cifs: Use BUILD_BUG_ON to validate uids and gids are the same size
    cifs: Override unmappable incoming uids and gids
    nfsd: Enable building with user namespaces enabled.
    nfsd: Properly compare and initialize kuids and kgids
    nfsd: Store ex_anon_uid and ex_anon_gid as kuids and kgids
    nfsd: Modify nfsd4_cb_sec to use kuids and kgids
    nfsd: Handle kuids and kgids in the nfs4acl to posix_acl conversion
    nfsd: Convert nfsxdr to use kuids and kgids
    nfsd: Convert nfs3xdr to use kuids and kgids
    ...

    Linus Torvalds
     

24 Feb, 2013

2 commits

  • do_mmap_pgoff() rounds up the desired size to the next PAGE_SIZE
    multiple, however there was no equivalent code in mm_populate(), which
    caused issues.

    This could be fixed by introduced the same rounding in mm_populate(),
    however I think it's preferable to make do_mmap_pgoff() return populate
    as a size rather than as a boolean, so we don't have to duplicate the
    size rounding logic in mm_populate().

    Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Tested-by: Andy Lutomirski
    Cc: Greg Ungerer
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • When creating new mappings using the MAP_POPULATE / MAP_LOCKED flags (or
    with MCL_FUTURE in effect), we want to populate the pages within the
    newly created vmas. This may take a while as we may have to read pages
    from disk, so ideally we want to do this outside of the write-locked
    mmap_sem region.

    This change introduces mm_populate(), which is used to defer populating
    such mappings until after the mmap_sem write lock has been released.
    This is implemented as a generalization of the former do_mlock_pages(),
    which accomplished the same task but was using during mlock() /
    mlockall().

    Signed-off-by: Michel Lespinasse
    Reported-by: Andy Lutomirski
    Acked-by: Rik van Riel
    Tested-by: Andy Lutomirski
    Cc: Greg Ungerer
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

23 Feb, 2013

2 commits

  • Allocating a file structure in function get_empty_filp() might fail because
    of several reasons:
    - not enough memory for file structures
    - operation is not allowed
    - user is over its limit

    Currently the function returns NULL in all cases and we loose the exact
    reason of the error. All callers of get_empty_filp() assume that the function
    can fail with ENFILE only.

    Return error through pointer. Change all callers to preserve this error code.

    [AV: cleaned up a bit, carved the get_empty_filp() part out into a separate commit
    (things remaining here deal with alloc_file()), removed pipe(2) behaviour change]

    Signed-off-by: Anatol Pomozov
    Reviewed-by: "Theodore Ts'o"
    Signed-off-by: Al Viro

    Anatol Pomozov
     
  • Signed-off-by: Al Viro

    Al Viro
     

28 Jan, 2013

1 commit

  • This patch allow the unprivileged user to mount mqueuefs in
    user ns.

    If two userns share the same ipcns,the files in mqueue fs
    should be seen in both these two userns.

    If the userns has its own ipcns,it has its own mqueue fs too.
    ipcns has already done this job well.

    Signed-off-by: Gao feng
    Signed-off-by: Eric W. Biederman

    Gao feng
     

05 Jan, 2013

9 commits

  • Signed-off-by: Stanislav Kinsbursky
    Cc: "Eric W. Biederman"
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stanislav Kinsbursky
     
  • Remove the redundant and confusing fill_copy(). Also add copy_msg()
    check for error. In this case exit from the function have to be done
    instead of break, because further code interprets any error as EAGAIN.

    Also define copy_msg() for the case when CONFIG_CHECKPOINT_RESTORE is
    disabled.

    Signed-off-by: Stanislav Kinsbursky
    Cc: "Eric W. Biederman"
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stanislav Kinsbursky
     
  • This code works if CONFIG_CHECKPOINT_RESTORE is disabled.

    [akpm@linux-foundation.org: remove __maybe_unused]
    Signed-off-by: Stanislav Kinsbursky
    Cc: "Eric W. Biederman"
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stanislav Kinsbursky
     
  • Passing and checking of msgflg to free_copy() is redundant. This patch
    sets copy to NULL on declaration instead and checks for non-NULL in
    free_copy().

    Note: in case of copy allocation failure, error is returned immediately.
    So no need to check for IS_ERR() in free_copy().

    Signed-off-by: Stanislav Kinsbursky
    Cc: "Eric W. Biederman"
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stanislav Kinsbursky
     
  • This test can be used to check wheither kernel supports IPC message queue
    copy and restore features (required by CRIU project).

    Signed-off-by: Stanislav Kinsbursky
    Cc: Serge Hallyn
    Cc: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Cc: Al Viro
    Cc: KOSAKI Motohiro
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stanislav Kinsbursky
     
  • This patch is required for checkpoint/restore in userspace.

    c/r requires some way to get all pending IPC messages without deleting
    them from the queue (checkpoint can fail and in this case tasks will be
    resumed, so queue have to be valid).

    To achive this, new operation flag MSG_COPY for sys_msgrcv() system call
    was introduced. If this flag was specified, then mtype is interpreted as
    number of the message to copy.

    If MSG_COPY is set, then kernel will allocate dummy message with passed
    size, and then use new copy_msg() helper function to copy desired message
    (instead of unlinking it from the queue).

    Notes:

    1) Return -ENOSYS if MSG_COPY is specified, but
    CONFIG_CHECKPOINT_RESTORE is not set.

    Signed-off-by: Stanislav Kinsbursky
    Cc: Serge Hallyn
    Cc: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Cc: Al Viro
    Cc: KOSAKI Motohiro
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stanislav Kinsbursky
     
  • Move all message related manipulation into one function msg_fill().
    Actually, two functions because of the compat one.

    [akpm@linux-foundation.org: checkpatch fixes]
    Signed-off-by: Stanislav Kinsbursky
    Cc: Serge Hallyn
    Cc: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Cc: Al Viro
    Cc: KOSAKI Motohiro
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stanislav Kinsbursky
     
  • Add 3 new variables and sysctls to tune them (by one "next_id" variable
    for messages, semaphores and shared memory respectively). This variable
    can be used to set desired id for next allocated IPC object. By default
    it's equal to -1 and old behaviour is preserved. If this variable is
    non-negative, then desired idr will be extracted from it and used as a
    start value to search for free IDR slot.

    Notes:

    1) this patch doesn't guarantee that the new object will have desired
    id. So it's up to user space how to handle new object with wrong id.

    2) After a sucessful id allocation attempt, "next_id" will be set back
    to -1 (if it was non-negative).

    [akpm@linux-foundation.org: checkpatch fixes]
    Signed-off-by: Stanislav Kinsbursky
    Cc: Serge Hallyn
    Cc: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Cc: Al Viro
    Cc: KOSAKI Motohiro
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stanislav Kinsbursky
     
  • This is a cleanup patch. The assignment is redundant.

    Signed-off-by: Stanislav Kinsbursky
    Cc: Serge Hallyn
    Cc: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Cc: Al Viro
    Cc: KOSAKI Motohiro
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stanislav Kinsbursky
     

19 Dec, 2012

1 commit

  • Pull (again) user namespace infrastructure changes from Eric Biederman:
    "Those bugs, those darn embarrasing bugs just want don't want to get
    fixed.

    Linus I just updated my mirror of your kernel.org tree and it appears
    you successfully pulled everything except the last 4 commits that fix
    those embarrasing bugs.

    When you get a chance can you please repull my branch"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    userns: Fix typo in description of the limitation of userns_install
    userns: Add a more complete capability subset test to commit_creds
    userns: Require CAP_SYS_ADMIN for most uses of setns.
    Fix cap_capable to only allow owners in the parent user namespace to have caps.

    Linus Torvalds
     

18 Dec, 2012

1 commit

  • Pull user namespace changes from Eric Biederman:
    "While small this set of changes is very significant with respect to
    containers in general and user namespaces in particular. The user
    space interface is now complete.

    This set of changes adds support for unprivileged users to create user
    namespaces and as a user namespace root to create other namespaces.
    The tyranny of supporting suid root preventing unprivileged users from
    using cool new kernel features is broken.

    This set of changes completes the work on setns, adding support for
    the pid, user, mount namespaces.

    This set of changes includes a bunch of basic pid namespace
    cleanups/simplifications. Of particular significance is the rework of
    the pid namespace cleanup so it no longer requires sending out
    tendrils into all kinds of unexpected cleanup paths for operation. At
    least one case of broken error handling is fixed by this cleanup.

    The files under /proc//ns/ have been converted from regular files
    to magic symlinks which prevents incorrect caching by the VFS,
    ensuring the files always refer to the namespace the process is
    currently using and ensuring that the ptrace_mayaccess permission
    checks are always applied.

    The files under /proc//ns/ have been given stable inode numbers
    so it is now possible to see if different processes share the same
    namespaces.

    Through the David Miller's net tree are changes to relax many of the
    permission checks in the networking stack to allowing the user
    namespace root to usefully use the networking stack. Similar changes
    for the mount namespace and the pid namespace are coming through my
    tree.

    Two small changes to add user namespace support were commited here adn
    in David Miller's -net tree so that I could complete the work on the
    /proc//ns/ files in this tree.

    Work remains to make it safe to build user namespaces and 9p, afs,
    ceph, cifs, coda, gfs2, ncpfs, nfs, nfsd, ocfs2, and xfs so the
    Kconfig guard remains in place preventing that user namespaces from
    being built when any of those filesystems are enabled.

    Future design work remains to allow root users outside of the initial
    user namespace to mount more than just /proc and /sys."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (38 commits)
    proc: Usable inode numbers for the namespace file descriptors.
    proc: Fix the namespace inode permission checks.
    proc: Generalize proc inode allocation
    userns: Allow unprivilged mounts of proc and sysfs
    userns: For /proc/self/{uid,gid}_map derive the lower userns from the struct file
    procfs: Print task uids and gids in the userns that opened the proc file
    userns: Implement unshare of the user namespace
    userns: Implent proc namespace operations
    userns: Kill task_user_ns
    userns: Make create_new_namespaces take a user_ns parameter
    userns: Allow unprivileged use of setns.
    userns: Allow unprivileged users to create new namespaces
    userns: Allow setting a userns mapping to your current uid.
    userns: Allow chown and setgid preservation
    userns: Allow unprivileged users to create user namespaces.
    userns: Ignore suid and sgid on binaries if the uid or gid can not be mapped
    userns: fix return value on mntns_install() failure
    vfs: Allow unprivileged manipulation of the mount namespace.
    vfs: Only support slave subtrees across different user namespaces
    vfs: Add a user namespace reference from struct mnt_namespace
    ...

    Linus Torvalds
     

15 Dec, 2012

1 commit

  • Andy Lutomirski found a nasty little bug in
    the permissions of setns. With unprivileged user namespaces it
    became possible to create new namespaces without privilege.

    However the setns calls were relaxed to only require CAP_SYS_ADMIN in
    the user nameapce of the targed namespace.

    Which made the following nasty sequence possible.

    pid = clone(CLONE_NEWUSER | CLONE_NEWNS);
    if (pid == 0) { /* child */
    system("mount --bind /home/me/passwd /etc/passwd");
    }
    else if (pid != 0) { /* parent */
    char path[PATH_MAX];
    snprintf(path, sizeof(path), "/proc/%u/ns/mnt");
    fd = open(path, O_RDONLY);
    setns(fd, 0);
    system("su -");
    }

    Prevent this possibility by requiring CAP_SYS_ADMIN
    in the current user namespace when joing all but the user namespace.

    Acked-by: Serge Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

12 Dec, 2012

1 commit

  • There was some desire in large applications using MAP_HUGETLB or
    SHM_HUGETLB to use 1GB huge pages on some mappings, and stay with 2MB on
    others. This is useful together with NUMA policy: use 2MB interleaving
    on some mappings, but 1GB on local mappings.

    This patch extends the IPC/SHM syscall interfaces slightly to allow
    specifying the page size.

    It borrows some upper bits in the existing flag arguments and allows
    encoding the log of the desired page size in addition to the *_HUGETLB
    flag. When 0 is specified the default size is used, this makes the
    change fully compatible.

    Extending the internal hugetlb code to handle this is straight forward.
    Instead of a single mount it just keeps an array of them and selects the
    right mount based on the specified page size. When no page size is
    specified it uses the mount of the default page size.

    The change is not visible in /proc/mounts because internal mounts don't
    appear there. It also has very little overhead: the additional mounts
    just consume a super block, but not more memory when not used.

    I also exported the new flags to the user headers (they were previously
    under __KERNEL__). Right now only symbols for x86 and some other
    architecture for 1GB and 2MB are defined. The interface should already
    work for all other architectures though. Only architectures that define
    multiple hugetlb sizes actually need it (that is currently x86, tile,
    powerpc). However tile and powerpc have user configurable hugetlb
    sizes, so it's not easy to add defines. A program on those
    architectures would need to query sysfs and use the appropiate log2.

    [akpm@linux-foundation.org: cleanups]
    [rientjes@google.com: fix build]
    [akpm@linux-foundation.org: checkpatch fixes]
    Signed-off-by: Andi Kleen
    Cc: Michael Kerrisk
    Acked-by: Rik van Riel
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hillf Danton
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

20 Nov, 2012

3 commits

  • Assign a unique proc inode to each namespace, and use that
    inode number to ensure we only allocate at most one proc
    inode for every namespace in proc.

    A single proc inode per namespace allows userspace to test
    to see if two processes are in the same namespace.

    This has been a long requested feature and only blocked because
    a naive implementation would put the id in a global space and
    would ultimately require having a namespace for the names of
    namespaces, making migration and certain virtualization tricks
    impossible.

    We still don't have per superblock inode numbers for proc, which
    appears necessary for application unaware checkpoint/restart and
    migrations (if the application is using namespace file descriptors)
    but that is now allowd by the design if it becomes important.

    I have preallocated the ipc and uts initial proc inode numbers so
    their structures can be statically initialized.

    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • Modify create_new_namespaces to explicitly take a user namespace
    parameter, instead of implicitly through the task_struct.

    This allows an implementation of unshare(CLONE_NEWUSER) where
    the new user namespace is not stored onto the current task_struct
    until after all of the namespaces are created.

    Acked-by: Serge Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • - Push the permission check from the core setns syscall into
    the setns install methods where the user namespace of the
    target namespace can be determined, and used in a ns_capable
    call.

    Acked-by: Serge Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

13 Oct, 2012

2 commits

  • Keep a pointer to the audit_names "slot" in struct filename.

    Have all of the audit_inode callers pass a struct filename ponter to
    audit_inode instead of a string pointer. If the aname field is already
    populated, then we can skip walking the list altogether and just use it
    directly.

    Signed-off-by: Jeff Layton
    Signed-off-by: Al Viro

    Jeff Layton
     
  • getname() is intended to copy pathname strings from userspace into a
    kernel buffer. The result is just a string in kernel space. It would
    however be quite helpful to be able to attach some ancillary info to
    the string.

    For instance, we could attach some audit-related info to reduce the
    amount of audit-related processing needed. When auditing is enabled,
    we could also call getname() on the string more than once and not
    need to recopy it from userspace.

    This patchset converts the getname()/putname() interfaces to return
    a struct instead of a string. For now, the struct just tracks the
    string in kernel space and the original userland pointer for it.

    Later, we'll add other information to the struct as it becomes
    convenient.

    Signed-off-by: Jeff Layton
    Signed-off-by: Al Viro

    Jeff Layton
     

12 Oct, 2012

1 commit

  • Currently, this gets set mostly by happenstance when we call into
    audit_inode_child. While that might be a little more efficient, it seems
    wrong. If the syscall ends up failing before audit_inode_child ever gets
    called, then you'll have an audit_names record that shows the full path
    but has the parent inode info attached.

    Fix this by passing in a parent flag when we call audit_inode that gets
    set to the value of LOOKUP_PARENT. We can then fix up the pathname for
    the audit entry correctly from the get-go.

    While we're at it, clean up the no-op macro for audit_inode in the
    !CONFIG_AUDITSYSCALL case.

    Signed-off-by: Jeff Layton
    Signed-off-by: Al Viro

    Jeff Layton
     

09 Oct, 2012

1 commit

  • Commit d6629859b36d ("ipc/mqueue: improve performance of send/recv") and
    ce2d52cc ("ipc/mqueue: add rbtree node caching support") introduced an
    rbtree of message priorities, and usage of rb_init_node() to initialize
    the corresponding nodes. As it turns out, rb_init_node() is unnecessary
    here, as the nodes are fully initialized on insertion by rb_link_node()
    and the code doesn't access nodes that aren't inserted on the rbtree.

    Removing the rb_init_node() calls as I removed that function during
    rbtree API cleanups (the only other use of it was in a place that
    similarly didn't require it).

    Signed-off-by: Michel Lespinasse
    Acked-by: Doug Ledford
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

03 Oct, 2012

2 commits

  • Pull vfs update from Al Viro:

    - big one - consolidation of descriptor-related logics; almost all of
    that is moved to fs/file.c

    (BTW, I'm seriously tempted to rename the result to fd.c. As it is,
    we have a situation when file_table.c is about handling of struct
    file and file.c is about handling of descriptor tables; the reasons
    are historical - file_table.c used to be about a static array of
    struct file we used to have way back).

    A lot of stray ends got cleaned up and converted to saner primitives,
    disgusting mess in android/binder.c is still disgusting, but at least
    doesn't poke so much in descriptor table guts anymore. A bunch of
    relatively minor races got fixed in process, plus an ext4 struct file
    leak.

    - related thing - fget_light() partially unuglified; see fdget() in
    there (and yes, it generates the code as good as we used to have).

    - also related - bits of Cyrill's procfs stuff that got entangled into
    that work; _not_ all of it, just the initial move to fs/proc/fd.c and
    switch of fdinfo to seq_file.

    - Alex's fs/coredump.c spiltoff - the same story, had been easier to
    take that commit than mess with conflicts. The rest is a separate
    pile, this was just a mechanical code movement.

    - a few misc patches all over the place. Not all for this cycle,
    there'll be more (and quite a few currently sit in akpm's tree)."

    Fix up trivial conflicts in the android binder driver, and some fairly
    simple conflicts due to two different changes to the sock_alloc_file()
    interface ("take descriptor handling from sock_alloc_file() to callers"
    vs "net: Providing protocol type via system.sockprotoname xattr of
    /proc/PID/fd entries" adding a dentry name to the socket)

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits)
    MAX_LFS_FILESIZE should be a loff_t
    compat: fs: Generic compat_sys_sendfile implementation
    fs: push rcu_barrier() from deactivate_locked_super() to filesystems
    btrfs: reada_extent doesn't need kref for refcount
    coredump: move core dump functionality into its own file
    coredump: prevent double-free on an error path in core dumper
    usb/gadget: fix misannotations
    fcntl: fix misannotations
    ceph: don't abuse d_delete() on failure exits
    hypfs: ->d_parent is never NULL or negative
    vfs: delete surplus inode NULL check
    switch simple cases of fget_light to fdget
    new helpers: fdget()/fdput()
    switch o2hb_region_dev_write() to fget_light()
    proc_map_files_readdir(): don't bother with grabbing files
    make get_file() return its argument
    vhost_set_vring(): turn pollstart/pollstop into bool
    switch prctl_set_mm_exe_file() to fget_light()
    switch xfs_find_handle() to fget_light()
    switch xfs_swapext() to fget_light()
    ...

    Linus Torvalds
     
  • Pull user namespace changes from Eric Biederman:
    "This is a mostly modest set of changes to enable basic user namespace
    support. This allows the code to code to compile with user namespaces
    enabled and removes the assumption there is only the initial user
    namespace. Everything is converted except for the most complex of the
    filesystems: autofs4, 9p, afs, ceph, cifs, coda, fuse, gfs2, ncpfs,
    nfs, ocfs2 and xfs as those patches need a bit more review.

    The strategy is to push kuid_t and kgid_t values are far down into
    subsystems and filesystems as reasonable. Leaving the make_kuid and
    from_kuid operations to happen at the edge of userspace, as the values
    come off the disk, and as the values come in from the network.
    Letting compile type incompatible compile errors (present when user
    namespaces are enabled) guide me to find the issues.

    The most tricky areas have been the places where we had an implicit
    union of uid and gid values and were storing them in an unsigned int.
    Those places were converted into explicit unions. I made certain to
    handle those places with simple trivial patches.

    Out of that work I discovered we have generic interfaces for storing
    quota by projid. I had never heard of the project identifiers before.
    Adding full user namespace support for project identifiers accounts
    for most of the code size growth in my git tree.

    Ultimately there will be work to relax privlige checks from
    "capable(FOO)" to "ns_capable(user_ns, FOO)" where it is safe allowing
    root in a user names to do those things that today we only forbid to
    non-root users because it will confuse suid root applications.

    While I was pushing kuid_t and kgid_t changes deep into the audit code
    I made a few other cleanups. I capitalized on the fact we process
    netlink messages in the context of the message sender. I removed
    usage of NETLINK_CRED, and started directly using current->tty.

    Some of these patches have also made it into maintainer trees, with no
    problems from identical code from different trees showing up in
    linux-next.

    After reading through all of this code I feel like I might be able to
    win a game of kernel trivial pursuit."

    Fix up some fairly trivial conflicts in netfilter uid/git logging code.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (107 commits)
    userns: Convert the ufs filesystem to use kuid/kgid where appropriate
    userns: Convert the udf filesystem to use kuid/kgid where appropriate
    userns: Convert ubifs to use kuid/kgid
    userns: Convert squashfs to use kuid/kgid where appropriate
    userns: Convert reiserfs to use kuid and kgid where appropriate
    userns: Convert jfs to use kuid/kgid where appropriate
    userns: Convert jffs2 to use kuid and kgid where appropriate
    userns: Convert hpfs to use kuid and kgid where appropriate
    userns: Convert btrfs to use kuid/kgid where appropriate
    userns: Convert bfs to use kuid/kgid where appropriate
    userns: Convert affs to use kuid/kgid wherwe appropriate
    userns: On alpha modify linux_to_osf_stat to use convert from kuids and kgids
    userns: On ia64 deal with current_uid and current_gid being kuid and kgid
    userns: On ppc convert current_uid from a kuid before printing.
    userns: Convert s390 getting uid and gid system calls to use kuid and kgid
    userns: Convert s390 hypfs to use kuid and kgid where appropriate
    userns: Convert binder ipc to use kuids
    userns: Teach security_path_chown to take kuids and kgids
    userns: Add user namespace support to IMA
    userns: Convert EVM to deal with kuids and kgids in it's hmac computation
    ...

    Linus Torvalds
     

27 Sep, 2012

2 commits


07 Sep, 2012

1 commit

  • - Store the ipc owner and creator with a kuid
    - Store the ipc group and the crators group with a kgid.
    - Add error handling to ipc_update_perms, allowing it to
    fail if the uids and gids can not be converted to kuids
    or kgids.
    - Modify the proc files to display the ipc creator and
    owner in the user namespace of the opener of the proc file.

    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

19 Aug, 2012

1 commit


31 Jul, 2012

4 commits

  • Rather than #define the options manually in the architecture code, add
    Kconfig options for them and select them there instead. This also allows
    us to select the compat IPC version parsing automatically for platforms
    using the old compat IPC interface.

    Reported-by: Andrew Morton
    Signed-off-by: Will Deacon
    Cc: Arnd Bergmann
    Cc: Chris Metcalf
    Cc: Catalin Marinas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Will Deacon
     
  • The msgsnd and msgrcv system calls use size_t to represent the size of the
    message being transferred. POSIX states that values of msgsz greater than
    SSIZE_MAX cause the result to be implementation-defined. On Linux, this
    equates to returning -EINVAL if (long) msgsz < 0.

    For compat tasks where !CONFIG_ARCH_WANT_OLD_COMPAT_IPC and compat_size_t
    is smaller than size_t, negative size values passed from userspace will be
    interpreted as positive values by do_msg{rcv,snd} and will fail to exit
    early with -EINVAL.

    This patch changes the compat prototypes for msg{rcv,snd} so that the
    message size is represented as a compat_ssize_t, which we cast to the
    native ssize_t type for the core IPC code.

    Cc: Arnd Bergmann
    Acked-by: Chris Metcalf
    Acked-by: Catalin Marinas
    Signed-off-by: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Will Deacon
     
  • Commit 48b25c43e6ee ("ipc: provide generic compat versions of IPC
    syscalls") added a new ARCH_WANT_OLD_COMPAT_IPC config option for
    architectures to select if their compat target requires the old IPC
    syscall interface.

    For architectures (such as AArch64) that do not require the internal
    calling conventions provided by this option, but have a compat target
    where the C library passes the IPC_64 flag explicitly,
    compat_ipc_parse_version no longer strips out the flag before calling
    the native system call implementation, resulting in unknown SHM/IPC
    commands and -EINVAL being returned to userspace.

    This patch separates the selection of the internal calling conventions
    for the IPC syscalls from the version parsing, allowing architectures to
    select __ARCH_WANT_COMPAT_IPC_PARSE_VERSION if they want to use version
    parsing whilst retaining the newer syscall calling conventions.

    Acked-by: Chris Metcalf
    Cc: Arnd Bergmann
    Acked-by: Catalin Marinas
    Signed-off-by: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Will Deacon
     
  • If the SHMLBA definition for a native task differs from the definition for
    a compat task, the do_shmat() function would need to handle both.

    This patch introduces COMPAT_SHMLBA, which is used by the compat shmat
    syscall when calling the ipc code and allows architectures such as AArch64
    (where the native SHMLBA is 64k but the compat (AArch32) definition is
    16k) to provide the correct semantics for compat IPC system calls.

    Cc: David S. Miller
    Cc: Chris Zankel
    Cc: Arnd Bergmann
    Acked-by: Catalin Marinas
    Signed-off-by: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Will Deacon
     

23 Jul, 2012

1 commit


14 Jul, 2012

1 commit