24 Feb, 2021

1 commit

  • Pull idmapped mounts from Christian Brauner:
    "This introduces idmapped mounts which has been in the making for some
    time. Simply put, different mounts can expose the same file or
    directory with different ownership. This initial implementation comes
    with ports for fat, ext4 and with Christoph's port for xfs with more
    filesystems being actively worked on by independent people and
    maintainers.

    Idmapping mounts handle a wide range of long standing use-cases. Here
    are just a few:

    - Idmapped mounts make it possible to easily share files between
    multiple users or multiple machines especially in complex
    scenarios. For example, idmapped mounts will be used in the
    implementation of portable home directories in
    systemd-homed.service(8) where they allow users to move their home
    directory to an external storage device and use it on multiple
    computers where they are assigned different uids and gids. This
    effectively makes it possible to assign random uids and gids at
    login time.

    - It is possible to share files from the host with unprivileged
    containers without having to change ownership permanently through
    chown(2).

    - It is possible to idmap a container's rootfs and without having to
    mangle every file. For example, Chromebooks use it to share the
    user's Download folder with their unprivileged containers in their
    Linux subsystem.

    - It is possible to share files between containers with
    non-overlapping idmappings.

    - Filesystem that lack a proper concept of ownership such as fat can
    use idmapped mounts to implement discretionary access (DAC)
    permission checking.

    - They allow users to efficiently changing ownership on a per-mount
    basis without having to (recursively) chown(2) all files. In
    contrast to chown (2) changing ownership of large sets of files is
    instantenous with idmapped mounts. This is especially useful when
    ownership of a whole root filesystem of a virtual machine or
    container is changed. With idmapped mounts a single syscall
    mount_setattr syscall will be sufficient to change the ownership of
    all files.

    - Idmapped mounts always take the current ownership into account as
    idmappings specify what a given uid or gid is supposed to be mapped
    to. This contrasts with the chown(2) syscall which cannot by itself
    take the current ownership of the files it changes into account. It
    simply changes the ownership to the specified uid and gid. This is
    especially problematic when recursively chown(2)ing a large set of
    files which is commong with the aforementioned portable home
    directory and container and vm scenario.

    - Idmapped mounts allow to change ownership locally, restricting it
    to specific mounts, and temporarily as the ownership changes only
    apply as long as the mount exists.

    Several userspace projects have either already put up patches and
    pull-requests for this feature or will do so should you decide to pull
    this:

    - systemd: In a wide variety of scenarios but especially right away
    in their implementation of portable home directories.

    https://systemd.io/HOME_DIRECTORY/

    - container runtimes: containerd, runC, LXD:To share data between
    host and unprivileged containers, unprivileged and privileged
    containers, etc. The pull request for idmapped mounts support in
    containerd, the default Kubernetes runtime is already up for quite
    a while now: https://github.com/containerd/containerd/pull/4734

    - The virtio-fs developers and several users have expressed interest
    in using this feature with virtual machines once virtio-fs is
    ported.

    - ChromeOS: Sharing host-directories with unprivileged containers.

    I've tightly synced with all those projects and all of those listed
    here have also expressed their need/desire for this feature on the
    mailing list. For more info on how people use this there's a bunch of
    talks about this too. Here's just two recent ones:

    https://www.cncf.io/wp-content/uploads/2020/12/Rootless-Containers-in-Gitpod.pdf
    https://fosdem.org/2021/schedule/event/containers_idmap/

    This comes with an extensive xfstests suite covering both ext4 and
    xfs:

    https://git.kernel.org/brauner/xfstests-dev/h/idmapped_mounts

    It covers truncation, creation, opening, xattrs, vfscaps, setid
    execution, setgid inheritance and more both with idmapped and
    non-idmapped mounts. It already helped to discover an unrelated xfs
    setgid inheritance bug which has since been fixed in mainline. It will
    be sent for inclusion with the xfstests project should you decide to
    merge this.

    In order to support per-mount idmappings vfsmounts are marked with
    user namespaces. The idmapping of the user namespace will be used to
    map the ids of vfs objects when they are accessed through that mount.
    By default all vfsmounts are marked with the initial user namespace.
    The initial user namespace is used to indicate that a mount is not
    idmapped. All operations behave as before and this is verified in the
    testsuite.

    Based on prior discussions we want to attach the whole user namespace
    and not just a dedicated idmapping struct. This allows us to reuse all
    the helpers that already exist for dealing with idmappings instead of
    introducing a whole new range of helpers. In addition, if we decide in
    the future that we are confident enough to enable unprivileged users
    to setup idmapped mounts the permission checking can take into account
    whether the caller is privileged in the user namespace the mount is
    currently marked with.

    The user namespace the mount will be marked with can be specified by
    passing a file descriptor refering to the user namespace as an
    argument to the new mount_setattr() syscall together with the new
    MOUNT_ATTR_IDMAP flag. The system call follows the openat2() pattern
    of extensibility.

    The following conditions must be met in order to create an idmapped
    mount:

    - The caller must currently have the CAP_SYS_ADMIN capability in the
    user namespace the underlying filesystem has been mounted in.

    - The underlying filesystem must support idmapped mounts.

    - The mount must not already be idmapped. This also implies that the
    idmapping of a mount cannot be altered once it has been idmapped.

    - The mount must be a detached/anonymous mount, i.e. it must have
    been created by calling open_tree() with the OPEN_TREE_CLONE flag
    and it must not already have been visible in the filesystem.

    The last two points guarantee easier semantics for userspace and the
    kernel and make the implementation significantly simpler.

    By default vfsmounts are marked with the initial user namespace and no
    behavioral or performance changes are observed.

    The manpage with a detailed description can be found here:

    https://git.kernel.org/brauner/man-pages/c/1d7b902e2875a1ff342e036a9f866a995640aea8

    In order to support idmapped mounts, filesystems need to be changed
    and mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. The
    patches to convert individual filesystem are not very large or
    complicated overall as can be seen from the included fat, ext4, and
    xfs ports. Patches for other filesystems are actively worked on and
    will be sent out separately. The xfstestsuite can be used to verify
    that port has been done correctly.

    The mount_setattr() syscall is motivated independent of the idmapped
    mounts patches and it's been around since July 2019. One of the most
    valuable features of the new mount api is the ability to perform
    mounts based on file descriptors only.

    Together with the lookup restrictions available in the openat2()
    RESOLVE_* flag namespace which we added in v5.6 this is the first time
    we are close to hardened and race-free (e.g. symlinks) mounting and
    path resolution.

    While userspace has started porting to the new mount api to mount
    proper filesystems and create new bind-mounts it is currently not
    possible to change mount options of an already existing bind mount in
    the new mount api since the mount_setattr() syscall is missing.

    With the addition of the mount_setattr() syscall we remove this last
    restriction and userspace can now fully port to the new mount api,
    covering every use-case the old mount api could. We also add the
    crucial ability to recursively change mount options for a whole mount
    tree, both removing and adding mount options at the same time. This
    syscall has been requested multiple times by various people and
    projects.

    There is a simple tool available at

    https://github.com/brauner/mount-idmapped

    that allows to create idmapped mounts so people can play with this
    patch series. I'll add support for the regular mount binary should you
    decide to pull this in the following weeks:

    Here's an example to a simple idmapped mount of another user's home
    directory:

    u1001@f2-vm:/$ sudo ./mount --idmap both:1000:1001:1 /home/ubuntu/ /mnt

    u1001@f2-vm:/$ ls -al /home/ubuntu/
    total 28
    drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 22:07 .
    drwxr-xr-x 4 root root 4096 Oct 28 04:00 ..
    -rw------- 1 ubuntu ubuntu 3154 Oct 28 22:12 .bash_history
    -rw-r--r-- 1 ubuntu ubuntu 220 Feb 25 2020 .bash_logout
    -rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25 2020 .bashrc
    -rw-r--r-- 1 ubuntu ubuntu 807 Feb 25 2020 .profile
    -rw-r--r-- 1 ubuntu ubuntu 0 Oct 16 16:11 .sudo_as_admin_successful
    -rw------- 1 ubuntu ubuntu 1144 Oct 28 00:43 .viminfo

    u1001@f2-vm:/$ ls -al /mnt/
    total 28
    drwxr-xr-x 2 u1001 u1001 4096 Oct 28 22:07 .
    drwxr-xr-x 29 root root 4096 Oct 28 22:01 ..
    -rw------- 1 u1001 u1001 3154 Oct 28 22:12 .bash_history
    -rw-r--r-- 1 u1001 u1001 220 Feb 25 2020 .bash_logout
    -rw-r--r-- 1 u1001 u1001 3771 Feb 25 2020 .bashrc
    -rw-r--r-- 1 u1001 u1001 807 Feb 25 2020 .profile
    -rw-r--r-- 1 u1001 u1001 0 Oct 16 16:11 .sudo_as_admin_successful
    -rw------- 1 u1001 u1001 1144 Oct 28 00:43 .viminfo

    u1001@f2-vm:/$ touch /mnt/my-file

    u1001@f2-vm:/$ setfacl -m u:1001:rwx /mnt/my-file

    u1001@f2-vm:/$ sudo setcap -n 1001 cap_net_raw+ep /mnt/my-file

    u1001@f2-vm:/$ ls -al /mnt/my-file
    -rw-rwxr--+ 1 u1001 u1001 0 Oct 28 22:14 /mnt/my-file

    u1001@f2-vm:/$ ls -al /home/ubuntu/my-file
    -rw-rwxr--+ 1 ubuntu ubuntu 0 Oct 28 22:14 /home/ubuntu/my-file

    u1001@f2-vm:/$ getfacl /mnt/my-file
    getfacl: Removing leading '/' from absolute path names
    # file: mnt/my-file
    # owner: u1001
    # group: u1001
    user::rw-
    user:u1001:rwx
    group::rw-
    mask::rwx
    other::r--

    u1001@f2-vm:/$ getfacl /home/ubuntu/my-file
    getfacl: Removing leading '/' from absolute path names
    # file: home/ubuntu/my-file
    # owner: ubuntu
    # group: ubuntu
    user::rw-
    user:ubuntu:rwx
    group::rw-
    mask::rwx
    other::r--"

    * tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: (41 commits)
    xfs: remove the possibly unused mp variable in xfs_file_compat_ioctl
    xfs: support idmapped mounts
    ext4: support idmapped mounts
    fat: handle idmapped mounts
    tests: add mount_setattr() selftests
    fs: introduce MOUNT_ATTR_IDMAP
    fs: add mount_setattr()
    fs: add attr_flags_to_mnt_flags helper
    fs: split out functions to hold writers
    namespace: only take read lock in do_reconfigure_mnt()
    mount: make {lock,unlock}_mount_hash() static
    namespace: take lock_mount_hash() directly when changing flags
    nfs: do not export idmapped mounts
    overlayfs: do not mount on top of idmapped mounts
    ecryptfs: do not mount on top of idmapped mounts
    ima: handle idmapped mounts
    apparmor: handle idmapped mounts
    fs: make helpers idmap mount aware
    exec: handle idmapped mounts
    would_dump: handle idmapped mounts
    ...

    Linus Torvalds
     

03 Feb, 2021

1 commit

  • syzbot found WARNINGs in several smackfs write operations where
    bytes count is passed to memdup_user_nul which exceeds
    GFP MAX_ORDER. Check count size if bigger than PAGE_SIZE.

    Per smackfs doc, smk_write_net4addr accepts any label or -CIPSO,
    smk_write_net6addr accepts any label or -DELETE. I couldn't find
    any general rule for other label lengths except SMK_LABELLEN,
    SMK_LONGLABEL, SMK_CIPSOMAX which are documented.

    Let's constrain, in general, smackfs label lengths for PAGE_SIZE.
    Although fuzzer crashes write to smackfs/netlabel on 0x400000 length.

    Here is a quick way to reproduce the WARNING:
    python -c "print('A' * 0x400000)" > /sys/fs/smackfs/netlabel

    Reported-by: syzbot+a71a442385a0b2815497@syzkaller.appspotmail.com
    Signed-off-by: Sabyrzhan Tasbolatov
    Signed-off-by: Casey Schaufler

    Sabyrzhan Tasbolatov
     

24 Jan, 2021

2 commits

  • When interacting with user namespace and non-user namespace aware
    filesystem capabilities the vfs will perform various security checks to
    determine whether or not the filesystem capabilities can be used by the
    caller, whether they need to be removed and so on. The main
    infrastructure for this resides in the capability codepaths but they are
    called through the LSM security infrastructure even though they are not
    technically an LSM or optional. This extends the existing security hooks
    security_inode_removexattr(), security_inode_killpriv(),
    security_inode_getsecurity() to pass down the mount's user namespace and
    makes them aware of idmapped mounts.

    In order to actually get filesystem capabilities from disk the
    capability infrastructure exposes the get_vfs_caps_from_disk() helper.
    For user namespace aware filesystem capabilities a root uid is stored
    alongside the capabilities.

    In order to determine whether the caller can make use of the filesystem
    capability or whether it needs to be ignored it is translated according
    to the superblock's user namespace. If it can be translated to uid 0
    according to that id mapping the caller can use the filesystem
    capabilities stored on disk. If we are accessing the inode that holds
    the filesystem capabilities through an idmapped mount we map the root
    uid according to the mount's user namespace. Afterwards the checks are
    identical to non-idmapped mounts: reading filesystem caps from disk
    enforces that the root uid associated with the filesystem capability
    must have a mapping in the superblock's user namespace and that the
    caller is either in the same user namespace or is a descendant of the
    superblock's user namespace. For filesystems that are mountable inside
    user namespace the caller can just mount the filesystem and won't
    usually need to idmap it. If they do want to idmap it they can create an
    idmapped mount and mark it with a user namespace they created and which
    is thus a descendant of s_user_ns. For filesystems that are not
    mountable inside user namespaces the descendant rule is trivially true
    because the s_user_ns will be the initial user namespace.

    If the initial user namespace is passed nothing changes so non-idmapped
    mounts will see identical behavior as before.

    Link: https://lore.kernel.org/r/20210121131959.646623-11-christian.brauner@ubuntu.com
    Cc: Christoph Hellwig
    Cc: David Howells
    Cc: Al Viro
    Cc: linux-fsdevel@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Acked-by: James Morris
    Signed-off-by: Christian Brauner

    Christian Brauner
     
  • When interacting with extended attributes the vfs verifies that the
    caller is privileged over the inode with which the extended attribute is
    associated. For posix access and posix default extended attributes a uid
    or gid can be stored on-disk. Let the functions handle posix extended
    attributes on idmapped mounts. If the inode is accessed through an
    idmapped mount we need to map it according to the mount's user
    namespace. Afterwards the checks are identical to non-idmapped mounts.
    This has no effect for e.g. security xattrs since they don't store uids
    or gids and don't perform permission checks on them like posix acls do.

    Link: https://lore.kernel.org/r/20210121131959.646623-10-christian.brauner@ubuntu.com
    Cc: Christoph Hellwig
    Cc: David Howells
    Cc: Al Viro
    Cc: linux-fsdevel@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Reviewed-by: James Morris
    Signed-off-by: Tycho Andersen
    Signed-off-by: Christian Brauner

    Tycho Andersen
     

25 Dec, 2020

1 commit

  • Pull smack fix from Casey Schaufler:
    "Provide a fix for the incorrect handling of privilege in the face of
    io_uring's use of kernel threads. That invalidated an long standing
    assumption regarding the privilege of kernel threads.

    The fix is simple and safe. It was provided by Jens Axboe and has been
    tested"

    * tag 'Smack-for-5.11-io_uring-fix' of git://github.com/cschaufler/smack-next:
    Smack: Handle io_uring kernel thread privileges

    Linus Torvalds
     

23 Dec, 2020

1 commit

  • Smack assumes that kernel threads are privileged for smackfs
    operations. This was necessary because the credential of the
    kernel thread was not related to a user operation. With io_uring
    the credential does reflect a user's rights and can be used.

    Suggested-by: Jens Axboe
    Acked-by: Jens Axboe
    Acked-by: Eric W. Biederman
    Signed-off-by: Casey Schaufler

    Casey Schaufler
     

17 Dec, 2020

1 commit


04 Dec, 2020

1 commit

  • A followup change to tcp_request_sock_op would have to drop the 'const'
    qualifier from the 'route_req' function as the
    'security_inet_conn_request' call is moved there - and that function
    expects a 'struct sock *'.

    However, it turns out its also possible to add a const qualifier to
    security_inet_conn_request instead.

    Signed-off-by: Florian Westphal
    Acked-by: James Morris
    Signed-off-by: Jakub Kicinski

    Florian Westphal
     

17 Nov, 2020

1 commit

  • This varible isn't used and can be removed to avoid a gcc warning:
    security/smack/smack_lsm.c:3873:6: warning: variable ‘rc’ set but not
    used [-Wunused-but-set-variable]

    Signed-off-by: Alex Shi
    Cc: Casey Schaufler
    Cc: James Morris
    Cc: "Serge E. Hallyn"
    Cc: linux-security-module@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Casey Schaufler

    Alex Shi
     

14 Nov, 2020

1 commit

  • The are some kernel-doc interface issues:
    security/smack/smackfs.c:1950: warning: Function parameter or member
    'list' not described in 'smk_parse_label_list'
    security/smack/smackfs.c:1950: warning: Excess function parameter
    'private' description in 'smk_parse_label_list'
    security/smack/smackfs.c:1979: warning: Function parameter or member
    'list' not described in 'smk_destroy_label_list'
    security/smack/smackfs.c:1979: warning: Excess function parameter 'head'
    description in 'smk_destroy_label_list'
    security/smack/smackfs.c:2141: warning: Function parameter or member
    'count' not described in 'smk_read_logging'
    security/smack/smackfs.c:2141: warning: Excess function parameter 'cn'
    description in 'smk_read_logging'
    security/smack/smackfs.c:2278: warning: Function parameter or member
    'format' not described in 'smk_user_access'

    Correct them in this patch.

    Signed-off-by: Alex Shi
    Cc: Casey Schaufler
    Cc: James Morris
    Cc: "Serge E. Hallyn"
    Cc: linux-security-module@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Casey Schaufler

    Alex Shi
     

14 Oct, 2020

1 commit

  • Pull smack updates from Casey Schaufler:
    "Two minor fixes and one performance enhancement to Smack. The
    performance improvement is significant and the new code is more like
    its counterpart in SELinux.

    - Two kernel test robot suggested clean-ups.

    - Teach Smack to use the IPv4 netlabel cache. This results in a
    12-14% improvement on TCP benchmarks"

    * tag 'Smack-for-5.10' of git://github.com/cschaufler/smack-next:
    Smack: Remove unnecessary variable initialization
    Smack: Fix build when NETWORK_SECMARK is not set
    Smack: Use the netlabel cache
    Smack: Set socket labels only once
    Smack: Consolidate uses of secmark into a function

    Linus Torvalds
     

06 Oct, 2020

1 commit


23 Sep, 2020

1 commit


12 Sep, 2020

3 commits


24 Aug, 2020

1 commit

  • Replace the existing /* fall through */ comments and its variants with
    the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
    fall-through markings when it is the case.

    [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

    Signed-off-by: Gustavo A. R. Silva

    Gustavo A. R. Silva
     

28 Jul, 2020

2 commits

  • We have an upper bound on "maplevel" but forgot to check for negative
    values.

    Fixes: e114e473771c ("Smack: Simplified Mandatory Access Control Kernel")
    Signed-off-by: Dan Carpenter
    Signed-off-by: Casey Schaufler

    Dan Carpenter
     
  • This is similar to commit 84e99e58e8d1 ("Smack: slab-out-of-bounds in
    vsscanf") where we added a bounds check on "rule".

    Reported-by: syzbot+a22c6092d003d6fe1122@syzkaller.appspotmail.com
    Fixes: f7112e6c9abf ("Smack: allow for significantly longer Smack labels v4")
    Signed-off-by: Dan Carpenter
    Signed-off-by: Casey Schaufler

    Dan Carpenter
     

15 Jul, 2020

1 commit

  • smk_write_relabel_self() frees memory from the task's credentials with
    no locking, which can easily cause a use-after-free because multiple
    tasks can share the same credentials structure.

    Fix this by using prepare_creds() and commit_creds() to correctly modify
    the task's credentials.

    Reproducer for "BUG: KASAN: use-after-free in smk_write_relabel_self":

    #include
    #include
    #include

    static void *thrproc(void *arg)
    {
    int fd = open("/sys/fs/smackfs/relabel-self", O_WRONLY);
    for (;;) write(fd, "foo", 3);
    }

    int main()
    {
    pthread_t t;
    pthread_create(&t, NULL, thrproc, NULL);
    thrproc(NULL);
    }

    Reported-by: syzbot+e6416dabb497a650da40@syzkaller.appspotmail.com
    Fixes: 38416e53936e ("Smack: limited capability for changing process label")
    Cc: # v4.4+
    Signed-off-by: Eric Biggers
    Signed-off-by: Casey Schaufler

    Eric Biggers
     

14 Jun, 2020

1 commit

  • …git/dhowells/linux-fs

    Pull notification queue from David Howells:
    "This adds a general notification queue concept and adds an event
    source for keys/keyrings, such as linking and unlinking keys and
    changing their attributes.

    Thanks to Debarshi Ray, we do have a pull request to use this to fix a
    problem with gnome-online-accounts - as mentioned last time:

    https://gitlab.gnome.org/GNOME/gnome-online-accounts/merge_requests/47

    Without this, g-o-a has to constantly poll a keyring-based kerberos
    cache to find out if kinit has changed anything.

    [ There are other notification pending: mount/sb fsinfo notifications
    for libmount that Karel Zak and Ian Kent have been working on, and
    Christian Brauner would like to use them in lxc, but let's see how
    this one works first ]

    LSM hooks are included:

    - A set of hooks are provided that allow an LSM to rule on whether or
    not a watch may be set. Each of these hooks takes a different
    "watched object" parameter, so they're not really shareable. The
    LSM should use current's credentials. [Wanted by SELinux & Smack]

    - A hook is provided to allow an LSM to rule on whether or not a
    particular message may be posted to a particular queue. This is
    given the credentials from the event generator (which may be the
    system) and the watch setter. [Wanted by Smack]

    I've provided SELinux and Smack with implementations of some of these
    hooks.

    WHY
    ===

    Key/keyring notifications are desirable because if you have your
    kerberos tickets in a file/directory, your Gnome desktop will monitor
    that using something like fanotify and tell you if your credentials
    cache changes.

    However, we also have the ability to cache your kerberos tickets in
    the session, user or persistent keyring so that it isn't left around
    on disk across a reboot or logout. Keyrings, however, cannot currently
    be monitored asynchronously, so the desktop has to poll for it - not
    so good on a laptop. This facility will allow the desktop to avoid the
    need to poll.

    DESIGN DECISIONS
    ================

    - The notification queue is built on top of a standard pipe. Messages
    are effectively spliced in. The pipe is opened with a special flag:

    pipe2(fds, O_NOTIFICATION_PIPE);

    The special flag has the same value as O_EXCL (which doesn't seem
    like it will ever be applicable in this context)[?]. It is given up
    front to make it a lot easier to prohibit splice&co from accessing
    the pipe.

    [?] Should this be done some other way? I'd rather not use up a new
    O_* flag if I can avoid it - should I add a pipe3() system call
    instead?

    The pipe is then configured::

    ioctl(fds[1], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
    ioctl(fds[1], IOC_WATCH_QUEUE_SET_FILTER, &filter);

    Messages are then read out of the pipe using read().

    - It should be possible to allow write() to insert data into the
    notification pipes too, but this is currently disabled as the
    kernel has to be able to insert messages into the pipe *without*
    holding pipe->mutex and the code to make this work needs careful
    auditing.

    - sendfile(), splice() and vmsplice() are disabled on notification
    pipes because of the pipe->mutex issue and also because they
    sometimes want to revert what they just did - but one or more
    notification messages might've been interleaved in the ring.

    - The kernel inserts messages with the wait queue spinlock held. This
    means that pipe_read() and pipe_write() have to take the spinlock
    to update the queue pointers.

    - Records in the buffer are binary, typed and have a length so that
    they can be of varying size.

    This allows multiple heterogeneous sources to share a common
    buffer; there are 16 million types available, of which I've used
    just a few, so there is scope for others to be used. Tags may be
    specified when a watchpoint is created to help distinguish the
    sources.

    - Records are filterable as types have up to 256 subtypes that can be
    individually filtered. Other filtration is also available.

    - Notification pipes don't interfere with each other; each may be
    bound to a different set of watches. Any particular notification
    will be copied to all the queues that are currently watching for it
    - and only those that are watching for it.

    - When recording a notification, the kernel will not sleep, but will
    rather mark a queue as having lost a message if there's
    insufficient space. read() will fabricate a loss notification
    message at an appropriate point later.

    - The notification pipe is created and then watchpoints are attached
    to it, using one of:

    keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fds[1], 0x01);
    watch_mount(AT_FDCWD, "/", 0, fd, 0x02);
    watch_sb(AT_FDCWD, "/mnt", 0, fd, 0x03);

    where in both cases, fd indicates the queue and the number after is
    a tag between 0 and 255.

    - Watches are removed if either the notification pipe is destroyed or
    the watched object is destroyed. In the latter case, a message will
    be generated indicating the enforced watch removal.

    Things I want to avoid:

    - Introducing features that make the core VFS dependent on the
    network stack or networking namespaces (ie. usage of netlink).

    - Dumping all this stuff into dmesg and having a daemon that sits
    there parsing the output and distributing it as this then puts the
    responsibility for security into userspace and makes handling
    namespaces tricky. Further, dmesg might not exist or might be
    inaccessible inside a container.

    - Letting users see events they shouldn't be able to see.

    TESTING AND MANPAGES
    ====================

    - The keyutils tree has a pipe-watch branch that has keyctl commands
    for making use of notifications. Proposed manual pages can also be
    found on this branch, though a couple of them really need to go to
    the main manpages repository instead.

    If the kernel supports the watching of keys, then running "make
    test" on that branch will cause the testing infrastructure to spawn
    a monitoring process on the side that monitors a notifications pipe
    for all the key/keyring changes induced by the tests and they'll
    all be checked off to make sure they happened.

    https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/keyutils.git/log/?h=pipe-watch

    - A test program is provided (samples/watch_queue/watch_test) that
    can be used to monitor for keyrings, mount and superblock events.
    Information on the notifications is simply logged to stdout"

    * tag 'notifications-20200601' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
    smack: Implement the watch_key and post_notification hooks
    selinux: Implement the watch_key security hook
    keys: Make the KEY_NEED_* perms an enum rather than a mask
    pipe: Add notification lossage handling
    pipe: Allow buffers to be marked read-whole-or-error for notifications
    Add sample notification program
    watch_queue: Add a key/keyring notification facility
    security: Add hooks to rule on setting a watch
    pipe: Add general notification queue support
    pipe: Add O_NOTIFICATION_PIPE
    security: Add a hook for the point of notification insertion
    uapi: General notification queue definitions

    Linus Torvalds
     

05 Jun, 2020

1 commit

  • Pull execve updates from Eric Biederman:
    "Last cycle for the Nth time I ran into bugs and quality of
    implementation issues related to exec that could not be easily be
    fixed because of the way exec is implemented. So I have been digging
    into exec and cleanup up what I can.

    I don't think I have exec sorted out enough to fix the issues I
    started with but I have made some headway this cycle with 4 sets of
    changes.

    - promised cleanups after introducing exec_update_mutex

    - trivial cleanups for exec

    - control flow simplifications

    - remove the recomputation of bprm->cred

    The net result is code that is a bit easier to understand and work
    with and a decrease in the number of lines of code (if you don't count
    the added tests)"

    * 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (24 commits)
    exec: Compute file based creds only once
    exec: Add a per bprm->file version of per_clear
    binfmt_elf_fdpic: fix execfd build regression
    selftests/exec: Add binfmt_script regression test
    exec: Remove recursion from search_binary_handler
    exec: Generic execfd support
    exec/binfmt_script: Don't modify bprm->buf and then return -ENOEXEC
    exec: Move the call of prepare_binprm into search_binary_handler
    exec: Allow load_misc_binary to call prepare_binprm unconditionally
    exec: Convert security_bprm_set_creds into security_bprm_repopulate_creds
    exec: Factor security_bprm_creds_for_exec out of security_bprm_set_creds
    exec: Teach prepare_exec_creds how exec treats uids & gids
    exec: Set the point of no return sooner
    exec: Move handling of the point of no return to the top level
    exec: Run sync_mm_rss before taking exec_update_mutex
    exec: Fix spelling of search_binary_handler in a comment
    exec: Move the comment from above de_thread to above unshare_sighand
    exec: Rename flush_old_exec begin_new_exec
    exec: Move most of setup_new_exec into flush_old_exec
    exec: In setup_new_exec cache current in the local variable me
    ...

    Linus Torvalds
     

21 May, 2020

1 commit

  • Today security_bprm_set_creds has several implementations:
    apparmor_bprm_set_creds, cap_bprm_set_creds, selinux_bprm_set_creds,
    smack_bprm_set_creds, and tomoyo_bprm_set_creds.

    Except for cap_bprm_set_creds they all test bprm->called_set_creds and
    return immediately if it is true. The function cap_bprm_set_creds
    ignores bprm->calld_sed_creds entirely.

    Create a new LSM hook security_bprm_creds_for_exec that is called just
    before prepare_binprm in __do_execve_file, resulting in a LSM hook
    that is called exactly once for the entire of exec. Modify the bits
    of security_bprm_set_creds that only want to be called once per exec
    into security_bprm_creds_for_exec, leaving only cap_bprm_set_creds
    behind.

    Remove bprm->called_set_creds all of it's former users have been moved
    to security_bprm_creds_for_exec.

    Add or upate comments a appropriate to bring them up to date and
    to reflect this change.

    Link: https://lkml.kernel.org/r/87v9kszrzh.fsf_-_@x220.int.ebiederm.org
    Acked-by: Linus Torvalds
    Acked-by: Casey Schaufler # For the LSM and Smack bits
    Reviewed-by: Kees Cook
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

19 May, 2020

2 commits

  • Implement the watch_key security hook in Smack to make sure that a key
    grants the caller Read permission in order to set a watch on a key.

    Also implement the post_notification security hook to make sure that the
    notification source is granted Write permission by the watch queue.

    For the moment, the watch_devices security hook is left unimplemented as
    it's not obvious what the object should be since the queue is global and
    didn't previously exist.

    Signed-off-by: David Howells
    Acked-by: Casey Schaufler

    David Howells
     
  • Since the meaning of combining the KEY_NEED_* constants is undefined, make
    it so that you can't do that by turning them into an enum.

    The enum is also given some extra values to represent special
    circumstances, such as:

    (1) The '0' value is reserved and causes a warning to trap the parameter
    being unset.

    (2) The key is to be unlinked and we require no permissions on it, only
    the keyring, (this replaces the KEY_LOOKUP_FOR_UNLINK flag).

    (3) An override due to CAP_SYS_ADMIN.

    (4) An override due to an instantiation token being present.

    (5) The permissions check is being deferred to later key_permission()
    calls.

    The extra values give the opportunity for LSMs to audit these situations.

    [Note: This really needs overhauling so that lookup_user_key() tells
    key_task_permission() and the LSM what operation is being done and leaves
    it to those functions to decide how to map that onto the available
    permits. However, I don't really want to make these change in the middle
    of the notifications patchset.]

    Signed-off-by: David Howells
    cc: Jarkko Sakkinen
    cc: Paul Moore
    cc: Stephen Smalley
    cc: Casey Schaufler
    cc: keyrings@vger.kernel.org
    cc: selinux@vger.kernel.org

    David Howells
     

12 May, 2020

1 commit


07 May, 2020

5 commits

  • The inode_smack cache is no longer used.
    Remove it.

    Signed-off-by: Vishal Goel
    Signed-off-by: Casey Schaufler

    Casey Schaufler
     
  • "smk_lock" mutex is used during inode instantiation in
    smack_d_instantiate()function. It has been used to avoid
    simultaneous access on same inode security structure.
    Since smack related initialization is done only once i.e during
    inode creation. If the inode has already been instantiated then
    smack_d_instantiate() function just returns without doing
    anything.

    So it means mutex lock is required only during inode creation.
    But since 2 processes can't create same inodes or files
    simultaneously. Also linking or some other file operation can't
    be done simultaneously when the file is getting created since
    file lookup will fail before dentry inode linkup which is done
    after smack initialization.
    So no mutex lock is required in inode_smack structure.

    It will save memory as well as improve some performance.
    If 40000 inodes are created in system, it will save 1.5 MB on
    32-bit systems & 2.8 MB on 64-bit systems.

    Signed-off-by: Vishal Goel
    Signed-off-by: Amit Sahrawat
    Signed-off-by: Casey Schaufler

    Casey Schaufler
     
  • Add barrier to soob. Return -EOVERFLOW if the buffer
    is exceeded.

    Suggested-by: Hillf Danton
    Reported-by: syzbot+bfdd4a2f07be52351350@syzkaller.appspotmail.com
    Signed-off-by: Casey Schaufler

    Casey Schaufler
     
  • commit afb1cbe37440 ("LSM: Infrastructure management
    of the inode security") removed usage of smk_rcu,
    thus removing it from structure.

    Signed-off-by: Maninder Singh
    Signed-off-by: Vaneet Narang
    Signed-off-by: Casey Schaufler

    Maninder Singh
     
  • The mix of IS_ENABLED() and #ifdef checks has left a combination
    that causes a warning about an unused variable:

    security/smack/smack_lsm.c: In function 'smack_socket_connect':
    security/smack/smack_lsm.c:2838:24: error: unused variable 'sip' [-Werror=unused-variable]
    2838 | struct sockaddr_in6 *sip = (struct sockaddr_in6 *)sap;

    Change the code to use C-style checks consistently so the compiler
    can handle it correctly.

    Fixes: 87fbfffcc89b ("broken ping to ipv6 linklocal addresses on debian buster")
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Casey Schaufler

    Arnd Bergmann
     

09 Feb, 2020

1 commit

  • Pull vfs file system parameter updates from Al Viro:
    "Saner fs_parser.c guts and data structures. The system-wide registry
    of syntax types (string/enum/int32/oct32/.../etc.) is gone and so is
    the horror switch() in fs_parse() that would have to grow another case
    every time something got added to that system-wide registry.

    New syntax types can be added by filesystems easily now, and their
    namespace is that of functions - not of system-wide enum members. IOW,
    they can be shared or kept private and if some turn out to be widely
    useful, we can make them common library helpers, etc., without having
    to do anything whatsoever to fs_parse() itself.

    And we already get that kind of requests - the thing that finally
    pushed me into doing that was "oh, and let's add one for timeouts -
    things like 15s or 2h". If some filesystem really wants that, let them
    do it. Without somebody having to play gatekeeper for the variants
    blessed by direct support in fs_parse(), TYVM.

    Quite a bit of boilerplate is gone. And IMO the data structures make a
    lot more sense now. -200LoC, while we are at it"

    * 'merge.nfs-fs_parse.1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (25 commits)
    tmpfs: switch to use of invalfc()
    cgroup1: switch to use of errorfc() et.al.
    procfs: switch to use of invalfc()
    hugetlbfs: switch to use of invalfc()
    cramfs: switch to use of errofc() et.al.
    gfs2: switch to use of errorfc() et.al.
    fuse: switch to use errorfc() et.al.
    ceph: use errorfc() and friends instead of spelling the prefix out
    prefix-handling analogues of errorf() and friends
    turn fs_param_is_... into functions
    fs_parse: handle optional arguments sanely
    fs_parse: fold fs_parameter_desc/fs_parameter_spec
    fs_parser: remove fs_parameter_description name field
    add prefix to fs_context->log
    ceph_parse_param(), ceph_parse_mon_ips(): switch to passing fc_log
    new primitive: __fs_parse()
    switch rbd and libceph to p_log-based primitives
    struct p_log, variants of warnf() et.al. taking that one instead
    teach logfc() to handle prefices, give it saner calling conventions
    get rid of cg_invalf()
    ...

    Linus Torvalds
     

08 Feb, 2020

2 commits


06 Feb, 2020

1 commit

  • I am seeing ping failures to IPv6 linklocal addresses with Debian
    buster. Easiest example to reproduce is:

    $ ping -c1 -w1 ff02::1%eth1
    connect: Invalid argument

    $ ping -c1 -w1 ff02::1%eth1
    PING ff02::01%eth1(ff02::1%eth1) 56 data bytes
    64 bytes from fe80::e0:f9ff:fe0c:37%eth1: icmp_seq=1 ttl=64 time=0.059 ms

    git bisect traced the failure to
    commit b9ef5513c99b ("smack: Check address length before reading address family")

    Arguably ping is being stupid since the buster version is not setting
    the address family properly (ping on stretch for example does):

    $ strace -e connect ping6 -c1 -w1 ff02::1%eth1
    connect(5, {sa_family=AF_UNSPEC,
    sa_data="\4\1\0\0\0\0\377\2\0\0\0\0\0\0\0\0\0\0\0\0\0\1\3\0\0\0"}, 28)
    = -1 EINVAL (Invalid argument)

    but the command works fine on kernels prior to this commit, so this is
    breakage which goes against the Linux paradigm of "don't break userspace"

    Cc: stable@vger.kernel.org
    Reported-by: David Ahern
    Suggested-by: Tetsuo Handa
    Signed-off-by: Casey Schaufler

     security/smack/smack_lsm.c | 41 +++++++++++++++++++----------------------
    1 file changed, 19 insertions(+), 22 deletions(-)

    Casey Schaufler
     

24 Oct, 2019

1 commit


24 Sep, 2019

1 commit

  • Pull smack updates from Casey Schaufler:
    "Four patches for v5.4. Nothing is major.

    All but one are in response to mechanically detected potential issues.
    The remaining patch cleans up kernel-doc notations"

    * tag 'smack-for-5.4-rc1' of git://github.com/cschaufler/smack-next:
    smack: use GFP_NOFS while holding inode_smack::smk_lock
    security: smack: Fix possible null-pointer dereferences in smack_socket_sock_rcv_skb()
    smack: fix some kernel-doc notations
    Smack: Don't ignore other bprm->unsafe flags if LSM_UNSAFE_PTRACE is set

    Linus Torvalds
     

05 Sep, 2019

3 commits

  • inode_smack::smk_lock is taken during smack_d_instantiate(), which is
    called during a filesystem transaction when creating a file on ext4.
    Therefore to avoid a deadlock, all code that takes this lock must use
    GFP_NOFS, to prevent memory reclaim from waiting for the filesystem
    transaction to complete.

    Reported-by: syzbot+0eefc1e06a77d327a056@syzkaller.appspotmail.com
    Cc: stable@vger.kernel.org
    Signed-off-by: Eric Biggers
    Signed-off-by: Casey Schaufler

    Eric Biggers
     
  • In smack_socket_sock_rcv_skb(), there is an if statement
    on line 3920 to check whether skb is NULL:
    if (skb && skb->secmark != 0)

    This check indicates skb can be NULL in some cases.

    But on lines 3931 and 3932, skb is used:
    ad.a.u.net->netif = skb->skb_iif;
    ipv6_skb_to_auditdata(skb, &ad.a, NULL);

    Thus, possible null-pointer dereferences may occur when skb is NULL.

    To fix these possible bugs, an if statement is added to check skb.

    These bugs are found by a static analysis tool STCheck written by us.

    Signed-off-by: Jia-Ju Bai
    Signed-off-by: Casey Schaufler

    Jia-Ju Bai
     
  • Fix/add kernel-doc notation and fix typos in security/smack/.

    Signed-off-by: Liguang Zhang
    Signed-off-by: Casey Schaufler

    luanshi