Eric Lee / smarc-fsl-linux-kernel

24 Feb, 2021

1 commit

7d6beb71d Merge tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux ... Browse Code »

Pull idmapped mounts from Christian Brauner:
"This introduces idmapped mounts which has been in the making for some
time. Simply put, different mounts can expose the same file or
directory with different ownership. This initial implementation comes
with ports for fat, ext4 and with Christoph's port for xfs with more
filesystems being actively worked on by independent people and
maintainers.

Idmapping mounts handle a wide range of long standing use-cases. Here
are just a few:

- Idmapped mounts make it possible to easily share files between
multiple users or multiple machines especially in complex
scenarios. For example, idmapped mounts will be used in the
implementation of portable home directories in
systemd-homed.service(8) where they allow users to move their home
directory to an external storage device and use it on multiple
computers where they are assigned different uids and gids. This
effectively makes it possible to assign random uids and gids at
login time.

- It is possible to share files from the host with unprivileged
containers without having to change ownership permanently through
chown(2).

- It is possible to idmap a container's rootfs and without having to
mangle every file. For example, Chromebooks use it to share the
user's Download folder with their unprivileged containers in their
Linux subsystem.

- It is possible to share files between containers with
non-overlapping idmappings.

- Filesystem that lack a proper concept of ownership such as fat can
use idmapped mounts to implement discretionary access (DAC)
permission checking.

- They allow users to efficiently changing ownership on a per-mount
basis without having to (recursively) chown(2) all files. In
contrast to chown (2) changing ownership of large sets of files is
instantenous with idmapped mounts. This is especially useful when
ownership of a whole root filesystem of a virtual machine or
container is changed. With idmapped mounts a single syscall
mount_setattr syscall will be sufficient to change the ownership of
all files.

- Idmapped mounts always take the current ownership into account as
idmappings specify what a given uid or gid is supposed to be mapped
to. This contrasts with the chown(2) syscall which cannot by itself
take the current ownership of the files it changes into account. It
simply changes the ownership to the specified uid and gid. This is
especially problematic when recursively chown(2)ing a large set of
files which is commong with the aforementioned portable home
directory and container and vm scenario.

- Idmapped mounts allow to change ownership locally, restricting it
to specific mounts, and temporarily as the ownership changes only
apply as long as the mount exists.

Several userspace projects have either already put up patches and
pull-requests for this feature or will do so should you decide to pull
this:

- systemd: In a wide variety of scenarios but especially right away
in their implementation of portable home directories.

https://systemd.io/HOME_DIRECTORY/

- container runtimes: containerd, runC, LXD:To share data between
host and unprivileged containers, unprivileged and privileged
containers, etc. The pull request for idmapped mounts support in
containerd, the default Kubernetes runtime is already up for quite
a while now: https://github.com/containerd/containerd/pull/4734

- The virtio-fs developers and several users have expressed interest
in using this feature with virtual machines once virtio-fs is
ported.

- ChromeOS: Sharing host-directories with unprivileged containers.

I've tightly synced with all those projects and all of those listed
here have also expressed their need/desire for this feature on the
mailing list. For more info on how people use this there's a bunch of
talks about this too. Here's just two recent ones:

https://www.cncf.io/wp-content/uploads/2020/12/Rootless-Containers-in-Gitpod.pdf
https://fosdem.org/2021/schedule/event/containers_idmap/

This comes with an extensive xfstests suite covering both ext4 and
xfs:

https://git.kernel.org/brauner/xfstests-dev/h/idmapped_mounts

It covers truncation, creation, opening, xattrs, vfscaps, setid
execution, setgid inheritance and more both with idmapped and
non-idmapped mounts. It already helped to discover an unrelated xfs
setgid inheritance bug which has since been fixed in mainline. It will
be sent for inclusion with the xfstests project should you decide to
merge this.

In order to support per-mount idmappings vfsmounts are marked with
user namespaces. The idmapping of the user namespace will be used to
map the ids of vfs objects when they are accessed through that mount.
By default all vfsmounts are marked with the initial user namespace.
The initial user namespace is used to indicate that a mount is not
idmapped. All operations behave as before and this is verified in the
testsuite.

Based on prior discussions we want to attach the whole user namespace
and not just a dedicated idmapping struct. This allows us to reuse all
the helpers that already exist for dealing with idmappings instead of
introducing a whole new range of helpers. In addition, if we decide in
the future that we are confident enough to enable unprivileged users
to setup idmapped mounts the permission checking can take into account
whether the caller is privileged in the user namespace the mount is
currently marked with.

The user namespace the mount will be marked with can be specified by
passing a file descriptor refering to the user namespace as an
argument to the new mount_setattr() syscall together with the new
MOUNT_ATTR_IDMAP flag. The system call follows the openat2() pattern
of extensibility.

The following conditions must be met in order to create an idmapped
mount:

- The caller must currently have the CAP_SYS_ADMIN capability in the
user namespace the underlying filesystem has been mounted in.

- The underlying filesystem must support idmapped mounts.

- The mount must not already be idmapped. This also implies that the
idmapping of a mount cannot be altered once it has been idmapped.

- The mount must be a detached/anonymous mount, i.e. it must have
been created by calling open_tree() with the OPEN_TREE_CLONE flag
and it must not already have been visible in the filesystem.

The last two points guarantee easier semantics for userspace and the
kernel and make the implementation significantly simpler.

By default vfsmounts are marked with the initial user namespace and no
behavioral or performance changes are observed.

The manpage with a detailed description can be found here:

https://git.kernel.org/brauner/man-pages/c/1d7b902e2875a1ff342e036a9f866a995640aea8

In order to support idmapped mounts, filesystems need to be changed
and mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. The
patches to convert individual filesystem are not very large or
complicated overall as can be seen from the included fat, ext4, and
xfs ports. Patches for other filesystems are actively worked on and
will be sent out separately. The xfstestsuite can be used to verify
that port has been done correctly.

The mount_setattr() syscall is motivated independent of the idmapped
mounts patches and it's been around since July 2019. One of the most
valuable features of the new mount api is the ability to perform
mounts based on file descriptors only.

Together with the lookup restrictions available in the openat2()
RESOLVE_* flag namespace which we added in v5.6 this is the first time
we are close to hardened and race-free (e.g. symlinks) mounting and
path resolution.

While userspace has started porting to the new mount api to mount
proper filesystems and create new bind-mounts it is currently not
possible to change mount options of an already existing bind mount in
the new mount api since the mount_setattr() syscall is missing.

With the addition of the mount_setattr() syscall we remove this last
restriction and userspace can now fully port to the new mount api,
covering every use-case the old mount api could. We also add the
crucial ability to recursively change mount options for a whole mount
tree, both removing and adding mount options at the same time. This
syscall has been requested multiple times by various people and
projects.

There is a simple tool available at

https://github.com/brauner/mount-idmapped

that allows to create idmapped mounts so people can play with this
patch series. I'll add support for the regular mount binary should you
decide to pull this in the following weeks:

Here's an example to a simple idmapped mount of another user's home
directory:

u1001@f2-vm:/$ sudo ./mount --idmap both:1000:1001:1 /home/ubuntu/ /mnt

u1001@f2-vm:/$ ls -al /home/ubuntu/
total 28
drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 22:07 .
drwxr-xr-x 4 root root 4096 Oct 28 04:00 ..
-rw------- 1 ubuntu ubuntu 3154 Oct 28 22:12 .bash_history
-rw-r--r-- 1 ubuntu ubuntu 220 Feb 25 2020 .bash_logout
-rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25 2020 .bashrc
-rw-r--r-- 1 ubuntu ubuntu 807 Feb 25 2020 .profile
-rw-r--r-- 1 ubuntu ubuntu 0 Oct 16 16:11 .sudo_as_admin_successful
-rw------- 1 ubuntu ubuntu 1144 Oct 28 00:43 .viminfo

u1001@f2-vm:/$ ls -al /mnt/
total 28
drwxr-xr-x 2 u1001 u1001 4096 Oct 28 22:07 .
drwxr-xr-x 29 root root 4096 Oct 28 22:01 ..
-rw------- 1 u1001 u1001 3154 Oct 28 22:12 .bash_history
-rw-r--r-- 1 u1001 u1001 220 Feb 25 2020 .bash_logout
-rw-r--r-- 1 u1001 u1001 3771 Feb 25 2020 .bashrc
-rw-r--r-- 1 u1001 u1001 807 Feb 25 2020 .profile
-rw-r--r-- 1 u1001 u1001 0 Oct 16 16:11 .sudo_as_admin_successful
-rw------- 1 u1001 u1001 1144 Oct 28 00:43 .viminfo

u1001@f2-vm:/$ touch /mnt/my-file

u1001@f2-vm:/$ setfacl -m u:1001:rwx /mnt/my-file

u1001@f2-vm:/$ sudo setcap -n 1001 cap_net_raw+ep /mnt/my-file

u1001@f2-vm:/$ ls -al /mnt/my-file
-rw-rwxr--+ 1 u1001 u1001 0 Oct 28 22:14 /mnt/my-file

u1001@f2-vm:/$ ls -al /home/ubuntu/my-file
-rw-rwxr--+ 1 ubuntu ubuntu 0 Oct 28 22:14 /home/ubuntu/my-file

u1001@f2-vm:/$ getfacl /mnt/my-file
getfacl: Removing leading '/' from absolute path names
# file: mnt/my-file
# owner: u1001
# group: u1001
user::rw-
user:u1001:rwx
group::rw-
mask::rwx
other::r--

u1001@f2-vm:/$ getfacl /home/ubuntu/my-file
getfacl: Removing leading '/' from absolute path names
# file: home/ubuntu/my-file
# owner: ubuntu
# group: ubuntu
user::rw-
user:ubuntu:rwx
group::rw-
mask::rwx
other::r--"

* tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: (41 commits)
xfs: remove the possibly unused mp variable in xfs_file_compat_ioctl
xfs: support idmapped mounts
ext4: support idmapped mounts
fat: handle idmapped mounts
tests: add mount_setattr() selftests
fs: introduce MOUNT_ATTR_IDMAP
fs: add mount_setattr()
fs: add attr_flags_to_mnt_flags helper
fs: split out functions to hold writers
namespace: only take read lock in do_reconfigure_mnt()
mount: make {lock,unlock}_mount_hash() static
namespace: take lock_mount_hash() directly when changing flags
nfs: do not export idmapped mounts
overlayfs: do not mount on top of idmapped mounts
ecryptfs: do not mount on top of idmapped mounts
ima: handle idmapped mounts
apparmor: handle idmapped mounts
fs: make helpers idmap mount aware
exec: handle idmapped mounts
would_dump: handle idmapped mounts
...

Linus Torvalds
2021-02-24 05:39:45 +0800

03 Feb, 2021

1 commit

7ef4c19d2 smackfs: restrict bytes count in smackfs write functions ... Browse Code »

syzbot found WARNINGs in several smackfs write operations where
bytes count is passed to memdup_user_nul which exceeds
GFP MAX_ORDER. Check count size if bigger than PAGE_SIZE.

Per smackfs doc, smk_write_net4addr accepts any label or -CIPSO,
smk_write_net6addr accepts any label or -DELETE. I couldn't find
any general rule for other label lengths except SMK_LABELLEN,
SMK_LONGLABEL, SMK_CIPSOMAX which are documented.

Let's constrain, in general, smackfs label lengths for PAGE_SIZE.
Although fuzzer crashes write to smackfs/netlabel on 0x400000 length.

Here is a quick way to reproduce the WARNING:
python -c "print('A' * 0x400000)" > /sys/fs/smackfs/netlabel

Reported-by: syzbot+a71a442385a0b2815497@syzkaller.appspotmail.com
Signed-off-by: Sabyrzhan Tasbolatov
Signed-off-by: Casey Schaufler

Sabyrzhan Tasbolatov
2021-02-03 09:14:02 +0800

24 Jan, 2021

2 commits

71bc356f9 commoncap: handle idmapped mounts ... Browse Code »

When interacting with user namespace and non-user namespace aware
filesystem capabilities the vfs will perform various security checks to
determine whether or not the filesystem capabilities can be used by the
caller, whether they need to be removed and so on. The main
infrastructure for this resides in the capability codepaths but they are
called through the LSM security infrastructure even though they are not
technically an LSM or optional. This extends the existing security hooks
security_inode_removexattr(), security_inode_killpriv(),
security_inode_getsecurity() to pass down the mount's user namespace and
makes them aware of idmapped mounts.

In order to actually get filesystem capabilities from disk the
capability infrastructure exposes the get_vfs_caps_from_disk() helper.
For user namespace aware filesystem capabilities a root uid is stored
alongside the capabilities.

In order to determine whether the caller can make use of the filesystem
capability or whether it needs to be ignored it is translated according
to the superblock's user namespace. If it can be translated to uid 0
according to that id mapping the caller can use the filesystem
capabilities stored on disk. If we are accessing the inode that holds
the filesystem capabilities through an idmapped mount we map the root
uid according to the mount's user namespace. Afterwards the checks are
identical to non-idmapped mounts: reading filesystem caps from disk
enforces that the root uid associated with the filesystem capability
must have a mapping in the superblock's user namespace and that the
caller is either in the same user namespace or is a descendant of the
superblock's user namespace. For filesystems that are mountable inside
user namespace the caller can just mount the filesystem and won't
usually need to idmap it. If they do want to idmap it they can create an
idmapped mount and mark it with a user namespace they created and which
is thus a descendant of s_user_ns. For filesystems that are not
mountable inside user namespaces the descendant rule is trivially true
because the s_user_ns will be the initial user namespace.

If the initial user namespace is passed nothing changes so non-idmapped
mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-11-christian.brauner@ubuntu.com
Cc: Christoph Hellwig
Cc: David Howells
Cc: Al Viro
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig
Acked-by: James Morris
Signed-off-by: Christian Brauner

Christian Brauner
2021-01-24 21:27:17 +0800
c7c7a1a18 xattr: handle idmapped mounts ... Browse Code »

When interacting with extended attributes the vfs verifies that the
caller is privileged over the inode with which the extended attribute is
associated. For posix access and posix default extended attributes a uid
or gid can be stored on-disk. Let the functions handle posix extended
attributes on idmapped mounts. If the inode is accessed through an
idmapped mount we need to map it according to the mount's user
namespace. Afterwards the checks are identical to non-idmapped mounts.
This has no effect for e.g. security xattrs since they don't store uids
or gids and don't perform permission checks on them like posix acls do.

Link: https://lore.kernel.org/r/20210121131959.646623-10-christian.brauner@ubuntu.com
Cc: Christoph Hellwig
Cc: David Howells
Cc: Al Viro
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig
Reviewed-by: James Morris
Signed-off-by: Tycho Andersen
Signed-off-by: Christian Brauner

Tycho Andersen
2021-01-24 21:27:17 +0800

25 Dec, 2020

1 commit

2f2fce3d5 Merge tag 'Smack-for-5.11-io_uring-fix' of git://github.com/cschaufler/smack-next ... Browse Code »

Pull smack fix from Casey Schaufler:
"Provide a fix for the incorrect handling of privilege in the face of
io_uring's use of kernel threads. That invalidated an long standing
assumption regarding the privilege of kernel threads.

The fix is simple and safe. It was provided by Jens Axboe and has been
tested"

* tag 'Smack-for-5.11-io_uring-fix' of git://github.com/cschaufler/smack-next:
Smack: Handle io_uring kernel thread privileges

Linus Torvalds
2020-12-25 06:08:43 +0800

23 Dec, 2020

1 commit

942cb357a Smack: Handle io_uring kernel thread privileges ... Browse Code »

Smack assumes that kernel threads are privileged for smackfs
operations. This was necessary because the credential of the
kernel thread was not related to a user operation. With io_uring
the credential does reflect a user's rights and can be used.

Suggested-by: Jens Axboe
Acked-by: Jens Axboe
Acked-by: Eric W. Biederman
Signed-off-by: Casey Schaufler

Casey Schaufler
2020-12-23 07:34:24 +0800

17 Dec, 2020

1 commit

8bda68d68 Merge tag 'Smack-for-5.11' of git://github.com/cschaufler/smack-next ... Browse Code »

Pull smack updates from Casey Schaufler:
"There are no functional changes. Just one minor code clean-up and a
set of corrections in function header comments"

* tag 'Smack-for-5.11' of git://github.com/cschaufler/smack-next:
security/smack: remove unused varible 'rc'
Smack: fix kernel-doc interface on functions

Linus Torvalds
2020-12-17 03:11:58 +0800

04 Dec, 2020

1 commit

41dd9596d security: add const qualifier to struct sock in various places ... Browse Code »

A followup change to tcp_request_sock_op would have to drop the 'const'
qualifier from the 'route_req' function as the
'security_inet_conn_request' call is moved there - and that function
expects a 'struct sock *'.

However, it turns out its also possible to add a const qualifier to
security_inet_conn_request instead.

Signed-off-by: Florian Westphal
Acked-by: James Morris
Signed-off-by: Jakub Kicinski

Florian Westphal
2020-12-04 04:56:03 +0800

17 Nov, 2020

1 commit

9b0072e2b security/smack: remove unused varible 'rc' ... Browse Code »

This varible isn't used and can be removed to avoid a gcc warning:
security/smack/smack_lsm.c:3873:6: warning: variable ‘rc’ set but not
used [-Wunused-but-set-variable]

Signed-off-by: Alex Shi
Cc: Casey Schaufler
Cc: James Morris
Cc: "Serge E. Hallyn"
Cc: linux-security-module@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Casey Schaufler

Alex Shi
2020-11-17 09:26:31 +0800

14 Nov, 2020

1 commit

7da31b858 Smack: fix kernel-doc interface on functions ... Browse Code »

The are some kernel-doc interface issues:
security/smack/smackfs.c:1950: warning: Function parameter or member
'list' not described in 'smk_parse_label_list'
security/smack/smackfs.c:1950: warning: Excess function parameter
'private' description in 'smk_parse_label_list'
security/smack/smackfs.c:1979: warning: Function parameter or member
'list' not described in 'smk_destroy_label_list'
security/smack/smackfs.c:1979: warning: Excess function parameter 'head'
description in 'smk_destroy_label_list'
security/smack/smackfs.c:2141: warning: Function parameter or member
'count' not described in 'smk_read_logging'
security/smack/smackfs.c:2141: warning: Excess function parameter 'cn'
description in 'smk_read_logging'
security/smack/smackfs.c:2278: warning: Function parameter or member
'format' not described in 'smk_user_access'

Correct them in this patch.

Signed-off-by: Alex Shi
Cc: Casey Schaufler
Cc: James Morris
Cc: "Serge E. Hallyn"
Cc: linux-security-module@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Casey Schaufler

Alex Shi
2020-11-14 03:50:44 +0800

14 Oct, 2020

1 commit

99a6740f8 Merge tag 'Smack-for-5.10' of git://github.com/cschaufler/smack-next ... Browse Code »

Pull smack updates from Casey Schaufler:
"Two minor fixes and one performance enhancement to Smack. The
performance improvement is significant and the new code is more like
its counterpart in SELinux.

- Two kernel test robot suggested clean-ups.

- Teach Smack to use the IPv4 netlabel cache. This results in a
12-14% improvement on TCP benchmarks"

* tag 'Smack-for-5.10' of git://github.com/cschaufler/smack-next:
Smack: Remove unnecessary variable initialization
Smack: Fix build when NETWORK_SECMARK is not set
Smack: Use the netlabel cache
Smack: Set socket labels only once
Smack: Consolidate uses of secmark into a function

Linus Torvalds
2020-10-14 07:18:51 +0800

06 Oct, 2020

1 commit

edd615371 Smack: Remove unnecessary variable initialization ... Browse Code »

The initialization of rc in smack_from_netlbl() is pointless.

Reported-by: kernel test robot
Signed-off-by: Casey Schaufler

Casey Schaufler
2020-10-06 05:20:51 +0800

23 Sep, 2020

1 commit

bf0afe673 Smack: Fix build when NETWORK_SECMARK is not set ... Browse Code »

Use proper conditional compilation for the secmark field in
the network skb.

Reported-by: kernel test robot
Signed-off-by: Casey Schaufler

Casey Schaufler
2020-09-23 05:59:31 +0800

12 Sep, 2020

3 commits

322dd63c7 Smack: Use the netlabel cache ... Browse Code »

Utilize the Netlabel cache mechanism for incoming packet matching.
Refactor the initialization of secattr structures, as it was being
done in two places.

Signed-off-by: Casey Schaufler

Casey Schaufler
2020-09-12 06:31:31 +0800
a2af03188 Smack: Set socket labels only once ... Browse Code »

Refactor the IP send checks so that the netlabel value
is set only when necessary, not on every send. Some functions
get renamed as the changes made the old name misleading.

Signed-off-by: Casey Schaufler

Casey Schaufler
2020-09-12 06:31:30 +0800
36be81293 Smack: Consolidate uses of secmark into a function ... Browse Code »

Add a function smack_from_skb() that returns the Smack label
identified by a network secmark. Replace the explicit uses of
the secmark with this function.

Signed-off-by: Casey Schaufler

Casey Schaufler
2020-09-12 06:31:30 +0800

24 Aug, 2020

1 commit

df561f668 treewide: Use fallthrough pseudo-keyword ... Browse Code »

Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.

[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

Signed-off-by: Gustavo A. R. Silva

Gustavo A. R. Silva
2020-08-24 06:36:59 +0800

28 Jul, 2020

2 commits

42a2df3e8 Smack: prevent underflow in smk_set_cipso() ... Browse Code »

We have an upper bound on "maplevel" but forgot to check for negative
values.

Fixes: e114e473771c ("Smack: Simplified Mandatory Access Control Kernel")
Signed-off-by: Dan Carpenter
Signed-off-by: Casey Schaufler

Dan Carpenter
2020-07-28 04:35:12 +0800
a6bd4f6d9 Smack: fix another vsscanf out of bounds ... Browse Code »

This is similar to commit 84e99e58e8d1 ("Smack: slab-out-of-bounds in
vsscanf") where we added a bounds check on "rule".

Reported-by: syzbot+a22c6092d003d6fe1122@syzkaller.appspotmail.com
Fixes: f7112e6c9abf ("Smack: allow for significantly longer Smack labels v4")
Signed-off-by: Dan Carpenter
Signed-off-by: Casey Schaufler

Dan Carpenter
2020-07-28 04:35:03 +0800

15 Jul, 2020

1 commit

beb4ee677 Smack: fix use-after-free in smk_write_relabel_self() ... Browse Code »

smk_write_relabel_self() frees memory from the task's credentials with
no locking, which can easily cause a use-after-free because multiple
tasks can share the same credentials structure.

Fix this by using prepare_creds() and commit_creds() to correctly modify
the task's credentials.

Reproducer for "BUG: KASAN: use-after-free in smk_write_relabel_self":

#include
#include
#include

static void *thrproc(void *arg)
{
int fd = open("/sys/fs/smackfs/relabel-self", O_WRONLY);
for (;;) write(fd, "foo", 3);
}

int main()
{
pthread_t t;
pthread_create(&t, NULL, thrproc, NULL);
thrproc(NULL);
}

Reported-by: syzbot+e6416dabb497a650da40@syzkaller.appspotmail.com
Fixes: 38416e53936e ("Smack: limited capability for changing process label")
Cc: # v4.4+
Signed-off-by: Eric Biggers
Signed-off-by: Casey Schaufler

Eric Biggers
2020-07-15 02:19:58 +0800

14 Jun, 2020

1 commit

6c3297841 Merge tag 'notifications-20200601' of git://git.kernel.org/pub/scm/linux/kernel/… ... Browse Code »

…git/dhowells/linux-fs

Pull notification queue from David Howells:
"This adds a general notification queue concept and adds an event
source for keys/keyrings, such as linking and unlinking keys and
changing their attributes.

Thanks to Debarshi Ray, we do have a pull request to use this to fix a
problem with gnome-online-accounts - as mentioned last time:

https://gitlab.gnome.org/GNOME/gnome-online-accounts/merge_requests/47

Without this, g-o-a has to constantly poll a keyring-based kerberos
cache to find out if kinit has changed anything.

[ There are other notification pending: mount/sb fsinfo notifications
for libmount that Karel Zak and Ian Kent have been working on, and
Christian Brauner would like to use them in lxc, but let's see how
this one works first ]

LSM hooks are included:

- A set of hooks are provided that allow an LSM to rule on whether or
not a watch may be set. Each of these hooks takes a different
"watched object" parameter, so they're not really shareable. The
LSM should use current's credentials. [Wanted by SELinux & Smack]

- A hook is provided to allow an LSM to rule on whether or not a
particular message may be posted to a particular queue. This is
given the credentials from the event generator (which may be the
system) and the watch setter. [Wanted by Smack]

I've provided SELinux and Smack with implementations of some of these
hooks.

WHY
===

Key/keyring notifications are desirable because if you have your
kerberos tickets in a file/directory, your Gnome desktop will monitor
that using something like fanotify and tell you if your credentials
cache changes.

However, we also have the ability to cache your kerberos tickets in
the session, user or persistent keyring so that it isn't left around
on disk across a reboot or logout. Keyrings, however, cannot currently
be monitored asynchronously, so the desktop has to poll for it - not
so good on a laptop. This facility will allow the desktop to avoid the
need to poll.

DESIGN DECISIONS
================

- The notification queue is built on top of a standard pipe. Messages
are effectively spliced in. The pipe is opened with a special flag:

pipe2(fds, O_NOTIFICATION_PIPE);

The special flag has the same value as O_EXCL (which doesn't seem
like it will ever be applicable in this context)[?]. It is given up
front to make it a lot easier to prohibit splice&co from accessing
the pipe.

[?] Should this be done some other way? I'd rather not use up a new
O_* flag if I can avoid it - should I add a pipe3() system call
instead?

The pipe is then configured::

ioctl(fds[1], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
ioctl(fds[1], IOC_WATCH_QUEUE_SET_FILTER, &filter);

Messages are then read out of the pipe using read().

- It should be possible to allow write() to insert data into the
notification pipes too, but this is currently disabled as the
kernel has to be able to insert messages into the pipe *without*
holding pipe->mutex and the code to make this work needs careful
auditing.

- sendfile(), splice() and vmsplice() are disabled on notification
pipes because of the pipe->mutex issue and also because they
sometimes want to revert what they just did - but one or more
notification messages might've been interleaved in the ring.

- The kernel inserts messages with the wait queue spinlock held. This
means that pipe_read() and pipe_write() have to take the spinlock
to update the queue pointers.

- Records in the buffer are binary, typed and have a length so that
they can be of varying size.

This allows multiple heterogeneous sources to share a common
buffer; there are 16 million types available, of which I've used
just a few, so there is scope for others to be used. Tags may be
specified when a watchpoint is created to help distinguish the
sources.

- Records are filterable as types have up to 256 subtypes that can be
individually filtered. Other filtration is also available.

- Notification pipes don't interfere with each other; each may be
bound to a different set of watches. Any particular notification
will be copied to all the queues that are currently watching for it
- and only those that are watching for it.

- When recording a notification, the kernel will not sleep, but will
rather mark a queue as having lost a message if there's
insufficient space. read() will fabricate a loss notification
message at an appropriate point later.

- The notification pipe is created and then watchpoints are attached
to it, using one of:

keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fds[1], 0x01);
watch_mount(AT_FDCWD, "/", 0, fd, 0x02);
watch_sb(AT_FDCWD, "/mnt", 0, fd, 0x03);

where in both cases, fd indicates the queue and the number after is
a tag between 0 and 255.

- Watches are removed if either the notification pipe is destroyed or
the watched object is destroyed. In the latter case, a message will
be generated indicating the enforced watch removal.

Things I want to avoid:

- Introducing features that make the core VFS dependent on the
network stack or networking namespaces (ie. usage of netlink).

- Dumping all this stuff into dmesg and having a daemon that sits
there parsing the output and distributing it as this then puts the
responsibility for security into userspace and makes handling
namespaces tricky. Further, dmesg might not exist or might be
inaccessible inside a container.

- Letting users see events they shouldn't be able to see.

TESTING AND MANPAGES
====================

- The keyutils tree has a pipe-watch branch that has keyctl commands
for making use of notifications. Proposed manual pages can also be
found on this branch, though a couple of them really need to go to
the main manpages repository instead.

If the kernel supports the watching of keys, then running "make
test" on that branch will cause the testing infrastructure to spawn
a monitoring process on the side that monitors a notifications pipe
for all the key/keyring changes induced by the tests and they'll
all be checked off to make sure they happened.

https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/keyutils.git/log/?h=pipe-watch

- A test program is provided (samples/watch_queue/watch_test) that
can be used to monitor for keyrings, mount and superblock events.
Information on the notifications is simply logged to stdout"

* tag 'notifications-20200601' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
smack: Implement the watch_key and post_notification hooks
selinux: Implement the watch_key security hook
keys: Make the KEY_NEED_* perms an enum rather than a mask
pipe: Add notification lossage handling
pipe: Allow buffers to be marked read-whole-or-error for notifications
Add sample notification program
watch_queue: Add a key/keyring notification facility
security: Add hooks to rule on setting a watch
pipe: Add general notification queue support
pipe: Add O_NOTIFICATION_PIPE
security: Add a hook for the point of notification insertion
uapi: General notification queue definitions

Linus Torvalds
2020-06-14 00:56:21 +0800

05 Jun, 2020

1 commit

15a2bc4db Merge branch 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull execve updates from Eric Biederman:
"Last cycle for the Nth time I ran into bugs and quality of
implementation issues related to exec that could not be easily be
fixed because of the way exec is implemented. So I have been digging
into exec and cleanup up what I can.

I don't think I have exec sorted out enough to fix the issues I
started with but I have made some headway this cycle with 4 sets of
changes.

- promised cleanups after introducing exec_update_mutex

- trivial cleanups for exec

- control flow simplifications

- remove the recomputation of bprm->cred

The net result is code that is a bit easier to understand and work
with and a decrease in the number of lines of code (if you don't count
the added tests)"

* 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (24 commits)
exec: Compute file based creds only once
exec: Add a per bprm->file version of per_clear
binfmt_elf_fdpic: fix execfd build regression
selftests/exec: Add binfmt_script regression test
exec: Remove recursion from search_binary_handler
exec: Generic execfd support
exec/binfmt_script: Don't modify bprm->buf and then return -ENOEXEC
exec: Move the call of prepare_binprm into search_binary_handler
exec: Allow load_misc_binary to call prepare_binprm unconditionally
exec: Convert security_bprm_set_creds into security_bprm_repopulate_creds
exec: Factor security_bprm_creds_for_exec out of security_bprm_set_creds
exec: Teach prepare_exec_creds how exec treats uids & gids
exec: Set the point of no return sooner
exec: Move handling of the point of no return to the top level
exec: Run sync_mm_rss before taking exec_update_mutex
exec: Fix spelling of search_binary_handler in a comment
exec: Move the comment from above de_thread to above unshare_sighand
exec: Rename flush_old_exec begin_new_exec
exec: Move most of setup_new_exec into flush_old_exec
exec: In setup_new_exec cache current in the local variable me
...

Linus Torvalds
2020-06-05 05:07:08 +0800

21 May, 2020

1 commit

b8bff5992 exec: Factor security_bprm_creds_for_exec out of security_bprm_set_creds ... Browse Code »

Today security_bprm_set_creds has several implementations:
apparmor_bprm_set_creds, cap_bprm_set_creds, selinux_bprm_set_creds,
smack_bprm_set_creds, and tomoyo_bprm_set_creds.

Except for cap_bprm_set_creds they all test bprm->called_set_creds and
return immediately if it is true. The function cap_bprm_set_creds
ignores bprm->calld_sed_creds entirely.

Create a new LSM hook security_bprm_creds_for_exec that is called just
before prepare_binprm in __do_execve_file, resulting in a LSM hook
that is called exactly once for the entire of exec. Modify the bits
of security_bprm_set_creds that only want to be called once per exec
into security_bprm_creds_for_exec, leaving only cap_bprm_set_creds
behind.

Remove bprm->called_set_creds all of it's former users have been moved
to security_bprm_creds_for_exec.

Add or upate comments a appropriate to bring them up to date and
to reflect this change.

Link: https://lkml.kernel.org/r/87v9kszrzh.fsf_-_@x220.int.ebiederm.org
Acked-by: Linus Torvalds
Acked-by: Casey Schaufler # For the LSM and Smack bits
Reviewed-by: Kees Cook
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2020-05-21 03:45:31 +0800

19 May, 2020

2 commits

a8478a602 smack: Implement the watch_key and post_notification hooks ... Browse Code »

Implement the watch_key security hook in Smack to make sure that a key
grants the caller Read permission in order to set a watch on a key.

Also implement the post_notification security hook to make sure that the
notification source is granted Write permission by the watch queue.

For the moment, the watch_devices security hook is left unimplemented as
it's not obvious what the object should be since the queue is global and
didn't previously exist.

Signed-off-by: David Howells
Acked-by: Casey Schaufler

David Howells
2020-05-19 22:47:38 +0800
8c0637e95 keys: Make the KEY_NEED_* perms an enum rather than a mask ... Browse Code »

Since the meaning of combining the KEY_NEED_* constants is undefined, make
it so that you can't do that by turning them into an enum.

The enum is also given some extra values to represent special
circumstances, such as:

(1) The '0' value is reserved and causes a warning to trap the parameter
being unset.

(2) The key is to be unlinked and we require no permissions on it, only
the keyring, (this replaces the KEY_LOOKUP_FOR_UNLINK flag).

(3) An override due to CAP_SYS_ADMIN.

(4) An override due to an instantiation token being present.

(5) The permissions check is being deferred to later key_permission()
calls.

The extra values give the opportunity for LSMs to audit these situations.

[Note: This really needs overhauling so that lookup_user_key() tells
key_task_permission() and the LSM what operation is being done and leaves
it to those functions to decide how to map that onto the available
permits. However, I don't really want to make these change in the middle
of the notifications patchset.]

Signed-off-by: David Howells
cc: Jarkko Sakkinen
cc: Paul Moore
cc: Stephen Smalley
cc: Casey Schaufler
cc: keyrings@vger.kernel.org
cc: selinux@vger.kernel.org

David Howells
2020-05-19 22:42:22 +0800

12 May, 2020

1 commit

ef26650a2 Smack: Remove unused inline function smk_ad_setfield_u_fs_path_mnt ... Browse Code »

commit a269434d2fb4 ("LSM: separate LSM_AUDIT_DATA_DENTRY from LSM_AUDIT_DATA_PATH")
left behind this, remove it.

Signed-off-by: YueHaibing
Signed-off-by: Casey Schaufler

YueHaibing
2020-05-12 01:25:37 +0800

07 May, 2020

5 commits

4ca752870 Smack:- Remove redundant inode_smack cache ... Browse Code »

The inode_smack cache is no longer used.
Remove it.

Signed-off-by: Vishal Goel
Signed-off-by: Casey Schaufler

Casey Schaufler
2020-05-07 05:46:26 +0800
921bb1cbb Smack:- Remove mutex lock "smk_lock" from inode_smack ... Browse Code »

"smk_lock" mutex is used during inode instantiation in
smack_d_instantiate()function. It has been used to avoid
simultaneous access on same inode security structure.
Since smack related initialization is done only once i.e during
inode creation. If the inode has already been instantiated then
smack_d_instantiate() function just returns without doing
anything.

So it means mutex lock is required only during inode creation.
But since 2 processes can't create same inodes or files
simultaneously. Also linking or some other file operation can't
be done simultaneously when the file is getting created since
file lookup will fail before dentry inode linkup which is done
after smack initialization.
So no mutex lock is required in inode_smack structure.

It will save memory as well as improve some performance.
If 40000 inodes are created in system, it will save 1.5 MB on
32-bit systems & 2.8 MB on 64-bit systems.

Signed-off-by: Vishal Goel
Signed-off-by: Amit Sahrawat
Signed-off-by: Casey Schaufler

Casey Schaufler
2020-05-07 05:46:26 +0800
84e99e58e Smack: slab-out-of-bounds in vsscanf ... Browse Code »

Add barrier to soob. Return -EOVERFLOW if the buffer
is exceeded.

Suggested-by: Hillf Danton
Reported-by: syzbot+bfdd4a2f07be52351350@syzkaller.appspotmail.com
Signed-off-by: Casey Schaufler

Casey Schaufler
2020-05-07 05:46:26 +0800
092c94aed smack: remove redundant structure variable from header. ... Browse Code »

commit afb1cbe37440 ("LSM: Infrastructure management
of the inode security") removed usage of smk_rcu,
thus removing it from structure.

Signed-off-by: Maninder Singh
Signed-off-by: Vaneet Narang
Signed-off-by: Casey Schaufler

Maninder Singh
2020-05-07 05:46:26 +0800
00720f0e7 smack: avoid unused 'sip' variable warning ... Browse Code »

The mix of IS_ENABLED() and #ifdef checks has left a combination
that causes a warning about an unused variable:

security/smack/smack_lsm.c: In function 'smack_socket_connect':
security/smack/smack_lsm.c:2838:24: error: unused variable 'sip' [-Werror=unused-variable]
2838 | struct sockaddr_in6 *sip = (struct sockaddr_in6 *)sap;

Change the code to use C-style checks consistently so the compiler
can handle it correctly.

Fixes: 87fbfffcc89b ("broken ping to ipv6 linklocal addresses on debian buster")
Signed-off-by: Arnd Bergmann
Signed-off-by: Casey Schaufler

Arnd Bergmann
2020-05-07 05:46:26 +0800

09 Feb, 2020

1 commit

c9d35ee04 Merge branch 'merge.nfs-fs_parse.1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs file system parameter updates from Al Viro:
"Saner fs_parser.c guts and data structures. The system-wide registry
of syntax types (string/enum/int32/oct32/.../etc.) is gone and so is
the horror switch() in fs_parse() that would have to grow another case
every time something got added to that system-wide registry.

New syntax types can be added by filesystems easily now, and their
namespace is that of functions - not of system-wide enum members. IOW,
they can be shared or kept private and if some turn out to be widely
useful, we can make them common library helpers, etc., without having
to do anything whatsoever to fs_parse() itself.

And we already get that kind of requests - the thing that finally
pushed me into doing that was "oh, and let's add one for timeouts -
things like 15s or 2h". If some filesystem really wants that, let them
do it. Without somebody having to play gatekeeper for the variants
blessed by direct support in fs_parse(), TYVM.

Quite a bit of boilerplate is gone. And IMO the data structures make a
lot more sense now. -200LoC, while we are at it"

* 'merge.nfs-fs_parse.1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (25 commits)
tmpfs: switch to use of invalfc()
cgroup1: switch to use of errorfc() et.al.
procfs: switch to use of invalfc()
hugetlbfs: switch to use of invalfc()
cramfs: switch to use of errofc() et.al.
gfs2: switch to use of errorfc() et.al.
fuse: switch to use errorfc() et.al.
ceph: use errorfc() and friends instead of spelling the prefix out
prefix-handling analogues of errorf() and friends
turn fs_param_is_... into functions
fs_parse: handle optional arguments sanely
fs_parse: fold fs_parameter_desc/fs_parameter_spec
fs_parser: remove fs_parameter_description name field
add prefix to fs_context->log
ceph_parse_param(), ceph_parse_mon_ips(): switch to passing fc_log
new primitive: __fs_parse()
switch rbd and libceph to p_log-based primitives
struct p_log, variants of warnf() et.al. taking that one instead
teach logfc() to handle prefices, give it saner calling conventions
get rid of cg_invalf()
...

Linus Torvalds
2020-02-09 05:26:41 +0800

08 Feb, 2020

2 commits

d7167b149 fs_parse: fold fs_parameter_desc/fs_parameter_spec ... Browse Code »

The former contains nothing but a pointer to an array of the latter...

Signed-off-by: Al Viro

Al Viro
2020-02-08 03:48:37 +0800
96cafb9cc fs_parser: remove fs_parameter_description name field ... Browse Code »

Unused now.

Signed-off-by: Eric Sandeen
Acked-by: David Howells
Signed-off-by: Al Viro

Eric Sandeen
2020-02-08 03:48:36 +0800

06 Feb, 2020

1 commit

87fbfffcc broken ping to ipv6 linklocal addresses on debian buster ... Browse Code »

I am seeing ping failures to IPv6 linklocal addresses with Debian
buster. Easiest example to reproduce is:

$ ping -c1 -w1 ff02::1%eth1
connect: Invalid argument

$ ping -c1 -w1 ff02::1%eth1
PING ff02::01%eth1(ff02::1%eth1) 56 data bytes
64 bytes from fe80::e0:f9ff:fe0c:37%eth1: icmp_seq=1 ttl=64 time=0.059 ms

git bisect traced the failure to
commit b9ef5513c99b ("smack: Check address length before reading address family")

Arguably ping is being stupid since the buster version is not setting
the address family properly (ping on stretch for example does):

$ strace -e connect ping6 -c1 -w1 ff02::1%eth1
connect(5, {sa_family=AF_UNSPEC,
sa_data="\4\1\0\0\0\0\377\2\0\0\0\0\0\0\0\0\0\0\0\0\0\1\3\0\0\0"}, 28)
= -1 EINVAL (Invalid argument)

but the command works fine on kernels prior to this commit, so this is
breakage which goes against the Linux paradigm of "don't break userspace"

Cc: stable@vger.kernel.org
Reported-by: David Ahern
Suggested-by: Tetsuo Handa
Signed-off-by: Casey Schaufler

security/smack/smack_lsm.c | 41 +++++++++++++++++++----------------------
1 file changed, 19 insertions(+), 22 deletions(-)

Casey Schaufler
2020-02-06 06:16:27 +0800

24 Oct, 2019

1 commit

d055b4fb4 pipe: Reduce #inclusion of pipe_fs_i.h ... Browse Code »

Remove some #inclusions of linux/pipe_fs_i.h that don't seem to be
necessary any more.

Signed-off-by: David Howells

David Howells
2019-10-24 00:02:34 +0800

24 Sep, 2019

1 commit

e94f8ccde Merge tag 'smack-for-5.4-rc1' of git://github.com/cschaufler/smack-next ... Browse Code »

Pull smack updates from Casey Schaufler:
"Four patches for v5.4. Nothing is major.

All but one are in response to mechanically detected potential issues.
The remaining patch cleans up kernel-doc notations"

* tag 'smack-for-5.4-rc1' of git://github.com/cschaufler/smack-next:
smack: use GFP_NOFS while holding inode_smack::smk_lock
security: smack: Fix possible null-pointer dereferences in smack_socket_sock_rcv_skb()
smack: fix some kernel-doc notations
Smack: Don't ignore other bprm->unsafe flags if LSM_UNSAFE_PTRACE is set

Linus Torvalds
2019-09-24 05:25:45 +0800

05 Sep, 2019

3 commits

e5bfad3d7 smack: use GFP_NOFS while holding inode_smack::smk_lock ... Browse Code »

inode_smack::smk_lock is taken during smack_d_instantiate(), which is
called during a filesystem transaction when creating a file on ext4.
Therefore to avoid a deadlock, all code that takes this lock must use
GFP_NOFS, to prevent memory reclaim from waiting for the filesystem
transaction to complete.

Reported-by: syzbot+0eefc1e06a77d327a056@syzkaller.appspotmail.com
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers
Signed-off-by: Casey Schaufler

Eric Biggers
2019-09-05 00:37:07 +0800
3f4287e7d security: smack: Fix possible null-pointer dereferences in smack_socket_sock_rcv_skb() ... Browse Code »

In smack_socket_sock_rcv_skb(), there is an if statement
on line 3920 to check whether skb is NULL:
if (skb && skb->secmark != 0)

This check indicates skb can be NULL in some cases.

But on lines 3931 and 3932, skb is used:
ad.a.u.net->netif = skb->skb_iif;
ipv6_skb_to_auditdata(skb, &ad.a, NULL);

Thus, possible null-pointer dereferences may occur when skb is NULL.

To fix these possible bugs, an if statement is added to check skb.

These bugs are found by a static analysis tool STCheck written by us.

Signed-off-by: Jia-Ju Bai
Signed-off-by: Casey Schaufler

Jia-Ju Bai
2019-09-05 00:37:07 +0800
a1a07f223 smack: fix some kernel-doc notations ... Browse Code »

Fix/add kernel-doc notation and fix typos in security/smack/.

Signed-off-by: Liguang Zhang
Signed-off-by: Casey Schaufler

luanshi
2019-09-05 00:37:07 +0800