01 Feb, 2019

1 commit

  • binderfs should not have a separate device_initcall(). When a kernel is
    compiled with CONFIG_ANDROID_BINDERFS register the filesystem alongside
    CONFIG_ANDROID_IPC. This use-case is especially sensible when users specify
    CONFIG_ANDROID_IPC=y, CONFIG_ANDROID_BINDERFS=y and
    ANDROID_BINDER_DEVICES="".
    When CONFIG_ANDROID_BINDERFS=n then this always succeeds so there's no
    regression potential for legacy workloads.

    Signed-off-by: Christian Brauner
    Signed-off-by: Greg Kroah-Hartman

    Christian Brauner
     

30 Jan, 2019

1 commit

  • We currently adhere to the reserved devices limit when creating new
    binderfs devices in binderfs instances not located in the inital ipc
    namespace. But it is still possible to rob the host instances of their 4
    reserved devices by creating the maximum allowed number of devices in a
    single binderfs instance located in a non-initial ipc namespace and then
    mounting 4 separate binderfs instances in non-initial ipc namespaces. That
    happens because the limit is currently not respected for the creation of
    the initial binder-control device node. Block this nonsense by performing
    the same check in binderfs_binder_ctl_create() that we perform in
    binderfs_binder_device_create().

    Fixes: 36bdf3cae09d ("binderfs: reserve devices for initial mount")
    Signed-off-by: Christian Brauner
    Signed-off-by: Greg Kroah-Hartman

    Christian Brauner
     

22 Jan, 2019

7 commits

  • In a previous commit we switched from a d_alloc_name() + d_lookup()
    combination to setup a new dentry and find potential duplicates to the more
    idiomatic lookup_one_len(). As far as I understand, this also means we need
    to switch from d_add() to d_instantiate() since lookup_one_len() will
    create a new dentry when it doesn't find an existing one and add the new
    dentry to the hash queues. So we only need to call d_instantiate() to
    connect the dentry to the inode and turn it into a positive dentry.

    If we were to use d_add() we sure see stack traces like the following
    indicating that adding the same dentry twice over the same inode:

    [ 744.441889] CPU: 4 PID: 2849 Comm: landscape-sysin Not tainted 5.0.0-rc1-brauner-binderfs #243
    [ 744.441889] Hardware name: Dell DCS XS24-SC2 /XS24-SC2 , BIOS S59_3C20 04/07/2011
    [ 744.441889] RIP: 0010:__d_lookup_rcu+0x76/0x190
    [ 744.441889] Code: 89 75 c0 49 c1 e9 20 49 89 fd 45 89 ce 41 83 e6 07 42 8d 04 f5 00 00 00 00 89 45 c8 eb 0c 48 8b 1b 48 85 db 0f 84 81 00 00 00 8b 63 fc 4c 3b 6b 10 75 ea 48 83 7b 08 00 74 e3 41 83 e4 fe 41
    [ 744.441889] RSP: 0018:ffffb8c984e27ad0 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
    [ 744.441889] RAX: 0000000000000038 RBX: ffff9407ef770c08 RCX: ffffb8c980011000
    [ 744.441889] RDX: ffffb8c984e27b54 RSI: ffffb8c984e27ce0 RDI: ffff9407e6689600
    [ 744.441889] RBP: ffffb8c984e27b28 R08: ffffb8c984e27ba4 R09: 0000000000000007
    [ 744.441889] R10: ffff9407e5c4f05c R11: 973f3eb9d84a94e5 R12: 0000000000000002
    [ 744.441889] R13: ffff9407e6689600 R14: 0000000000000007 R15: 00000007bfef7a13
    [ 744.441889] FS: 00007f0db13bb740(0000) GS:ffff9407f3b00000(0000) knlGS:0000000000000000
    [ 744.441889] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 744.441889] CR2: 00007f0dacc51024 CR3: 000000032961a000 CR4: 00000000000006e0
    [ 744.441889] Call Trace:
    [ 744.441889] lookup_fast+0x53/0x300
    [ 744.441889] walk_component+0x49/0x350
    [ 744.441889] ? inode_permission+0x63/0x1a0
    [ 744.441889] link_path_walk.part.33+0x1bc/0x5a0
    [ 744.441889] ? path_init+0x190/0x310
    [ 744.441889] path_lookupat+0x95/0x210
    [ 744.441889] filename_lookup+0xb6/0x190
    [ 744.441889] ? __check_object_size+0xb8/0x1b0
    [ 744.441889] ? strncpy_from_user+0x50/0x1a0
    [ 744.441889] user_path_at_empty+0x36/0x40
    [ 744.441889] ? user_path_at_empty+0x36/0x40
    [ 744.441889] vfs_statx+0x76/0xe0
    [ 744.441889] __do_sys_newstat+0x3d/0x70
    [ 744.441889] __x64_sys_newstat+0x16/0x20
    [ 744.441889] do_syscall_64+0x5a/0x120
    [ 744.441889] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [ 744.441889] RIP: 0033:0x7f0db0ec2775
    [ 744.441889] Code: 00 00 00 75 05 48 83 c4 18 c3 e8 26 55 02 00 66 0f 1f 44 00 00 83 ff 01 48 89 f0 77 30 48 89 c7 48 89 d6 b8 04 00 00 00 0f 05 3d 00 f0 ff ff 77 03 f3 c3 90 48 8b 15 e1 b6 2d 00 f7 d8 64 89
    [ 744.441889] RSP: 002b:00007ffc36bc9388 EFLAGS: 00000246 ORIG_RAX: 0000000000000004
    [ 744.441889] RAX: ffffffffffffffda RBX: 00007ffc36bc9300 RCX: 00007f0db0ec2775
    [ 744.441889] RDX: 00007ffc36bc9400 RSI: 00007ffc36bc9400 RDI: 00007f0dad26f050
    [ 744.441889] RBP: 0000000000c0bc60 R08: 0000000000000000 R09: 0000000000000001
    [ 744.441889] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffc36bc9400
    [ 744.441889] R13: 0000000000000001 R14: 00000000ffffff9c R15: 0000000000c0bc60

    Cc: Al Viro
    Signed-off-by: Christian Brauner
    Signed-off-by: Greg Kroah-Hartman

    Christian Brauner
     
  • The binderfs_binder_ctl_create() call is a no-op on subsequent calls and
    the first call is done before we unlock the suberblock. Hence, there is no
    need to take inode_lock() in there. Let's remove it.

    Suggested-by: Al Viro
    Signed-off-by: Christian Brauner
    Signed-off-by: Greg Kroah-Hartman

    Christian Brauner
     
  • Al pointed out that first calling kill_litter_super() before cleaning up
    info is more correct since destroying info doesn't depend on the state of
    the dentries and inodes. That the opposite remains true is not guaranteed.

    Suggested-by: Al Viro
    Signed-off-by: Christian Brauner
    Signed-off-by: Greg Kroah-Hartman

    Christian Brauner
     
  • - switch from d_alloc_name() + d_lookup() to lookup_one_len():
    Instead of using d_alloc_name() and then doing a d_lookup() with the
    allocated dentry to find whether a device with the name we're trying to
    create already exists switch to using lookup_one_len(). The latter will
    either return the existing dentry or a new one.

    - switch from kmalloc() + strscpy() to kmemdup():
    Use a more idiomatic way to copy the name for the new dentry that
    userspace gave us.

    Suggested-by: Al Viro
    Signed-off-by: Christian Brauner
    Signed-off-by: Greg Kroah-Hartman

    Christian Brauner
     
  • Al pointed out that on binderfs_fill_super() error
    deactivate_locked_super() will call binderfs_kill_super() so all of the
    freeing and putting we currently do in binderfs_fill_super() is unnecessary
    and buggy. Let's simply return errors and let binderfs_fill_super() take
    care of cleaning up on error.

    Suggested-by: Al Viro
    Signed-off-by: Christian Brauner
    Signed-off-by: Greg Kroah-Hartman

    Christian Brauner
     
  • - make binderfs control dentry immutable:
    We don't allow to unlink it since it is crucial for binderfs to be
    useable but if we allow to rename it we make the unlink trivial to
    bypass. So prevent renaming too and simply treat the control dentry as
    immutable.

    - add is_binderfs_control_device() helper:
    Take the opportunity and turn the check for the control dentry into a
    separate helper is_binderfs_control_device() since it's now used in two
    places.

    - simplify binderfs_rename():
    Instead of hand-rolling our custom version of simple_rename() just dumb
    the whole function down to first check whether we're trying to rename the
    control dentry. If we do EPERM the caller and if not call simple_rename().

    Suggested-by: Al Viro
    Signed-off-by: Christian Brauner
    Signed-off-by: Greg Kroah-Hartman

    Christian Brauner
     
  • The comment stems from an early version of that patchset and is just
    confusing now.

    Cc: Al Viro
    Signed-off-by: Christian Brauner
    Signed-off-by: Greg Kroah-Hartman

    Christian Brauner
     

18 Jan, 2019

1 commit

  • Fix to return a negative error code -ENOMEM from the new_inode() and
    d_make_root() error handling cases instead of 0, as done elsewhere in
    this function.

    Fixes: 849d540ddfcd ("binderfs: implement "max" mount option")
    Signed-off-by: Wei Yongjun
    Reviewed-by: Christian Brauner
    Signed-off-by: Greg Kroah-Hartman

    Wei Yongjun
     

12 Jan, 2019

1 commit

  • kbuild reported a build faile in [1]. This is triggered when CONFIG_IPC_NS
    is not set. So let's make the use of init_ipc_ns conditional on
    CONFIG_IPC_NS being set.

    [1]: https://lists.01.org/pipermail/kbuild-all/2019-January/056903.html

    Signed-off-by: Christian Brauner
    Signed-off-by: Greg Kroah-Hartman

    Christian Brauner
     

11 Jan, 2019

3 commits

  • The binderfs instance in the initial ipc namespace will always have a
    reserve of 4 binder devices unless explicitly capped by specifying a lower
    value via the "max" mount option.
    This ensures when binder devices are removed (on accident or on purpose)
    they can always be recreated without risking that all minor numbers have
    already been used up.

    Cc: Todd Kjos
    Cc: Greg Kroah-Hartman
    Signed-off-by: Christian Brauner
    Signed-off-by: Greg Kroah-Hartman

    Christian Brauner
     
  • It doesn't make sense to call the header binder_ctl.h when its sole
    existence is tied to binderfs. So give it a sensible name. Users will far
    more easily remember binderfs.h than binder_ctl.h.

    Signed-off-by: Christian Brauner
    Signed-off-by: Greg Kroah-Hartman

    Christian Brauner
     
  • Since binderfs can be mounted by userns root in non-initial user namespaces
    some precautions are in order. First, a way to set a maximum on the number
    of binder devices that can be allocated per binderfs instance and second, a
    way to reserve a reasonable chunk of binderfs devices for the initial ipc
    namespace.
    A first approach as seen in [1] used sysctls similiar to devpts but was
    shown to be flawed (cf. [2] and [3]) since some aspects were unneeded. This
    is an alternative approach which avoids sysctls completely and instead
    switches to a single mount option.

    Starting with this commit binderfs instances can be mounted with a limit on
    the number of binder devices that can be allocated. The max= mount
    option serves as a per-instance limit. If max= is set then only
    number of binder devices can be allocated in this binderfs
    instance.

    This allows to safely bind-mount binderfs instances into unprivileged user
    namespaces since userns root in a non-initial user namespace cannot change
    the mount option as long as it does not own the mount namespace the
    binderfs mount was created in and hence cannot drain the host of minor
    device numbers

    [1]: https://lore.kernel.org/lkml/20181221133909.18794-1-christian@brauner.io/
    [2]; https://lore.kernel.org/lkml/20181221163316.GA8517@kroah.com/
    [3]: https://lore.kernel.org/lkml/CAHRSSEx+gDVW4fKKK8oZNAir9G5icJLyodO8hykv3O0O1jt2FQ@mail.gmail.com/
    [4]: https://lore.kernel.org/lkml/20181221192044.5yvfnuri7gdop4rs@brauner.io/

    Cc: Todd Kjos
    Cc: Greg Kroah-Hartman
    Signed-off-by: Christian Brauner
    Signed-off-by: Greg Kroah-Hartman

    Christian Brauner
     

08 Jan, 2019

2 commits

  • When currently mounting binderfs in the same ipc namespace twice:

    mount -t binder binder /A
    mount -t binder binder /B

    then the binderfs instances mounted on /A and /B will be the same, i.e.
    they will have the same superblock. This was the first approach that seemed
    reasonable. However, this leads to some problems and inconsistencies:

    /* private binderfs instance in same ipc namespace */
    There is no way for a user to request a private binderfs instance in the
    same ipc namespace.
    This request has been made in a private mail to me by two independent
    people.

    /* bind-mounts */
    If users want the same binderfs instance to appear in multiple places they
    can use bind mounts. So there is no value in having a request for a new
    binderfs mount giving them the same instance.

    /* unexpected behavior */
    It's surprising that request to mount binderfs is not giving the user a new
    instance like tmpfs, devpts, ramfs, and others do.

    /* past mistakes */
    Other pseudo-filesystems once made the same mistakes of giving back the
    same superblock when actually requesting a new mount (cf. devpts's
    deprecated "newinstance" option).
    We should not make the same mistake. Once we've committed to always giving
    back the same superblock in the same IPC namespace with the next kernel
    release we will not be able to make that change so better to do it now.

    /* kdbusfs */
    It was pointed out to me that kdbusfs - which is conceptually closely
    related to binderfs - also allowed users to get a private kdbusfs instance
    in the same IPC namespace by making each mount of kdbusfs a separate
    instance. I think that makes a lot of sense.

    Signed-off-by: Christian Brauner
    Signed-off-by: Greg Kroah-Hartman

    Christian Brauner
     
  • The binderfs filesystem never needs to be mounted by the kernel itself.
    This is conceptually wrong and should never have been done in the first
    place.

    Fixes: 3ad20fe393b ("binder: implement binderfs")
    Signed-off-by: Christian Brauner
    Signed-off-by: Greg Kroah-Hartman

    Christian Brauner
     

19 Dec, 2018

1 commit

  • As discussed at Linux Plumbers Conference 2018 in Vancouver [1] this is the
    implementation of binderfs.

    /* Abstract */
    binderfs is a backwards-compatible filesystem for Android's binder ipc
    mechanism. Each ipc namespace will mount a new binderfs instance. Mounting
    binderfs multiple times at different locations in the same ipc namespace
    will not cause a new super block to be allocated and hence it will be the
    same filesystem instance.
    Each new binderfs mount will have its own set of binder devices only
    visible in the ipc namespace it has been mounted in. All devices in a new
    binderfs mount will follow the scheme binder%d and numbering will always
    start at 0.

    /* Backwards compatibility */
    Devices requested in the Kconfig via CONFIG_ANDROID_BINDER_DEVICES for the
    initial ipc namespace will work as before. They will be registered via
    misc_register() and appear in the devtmpfs mount. Specifically, the
    standard devices binder, hwbinder, and vndbinder will all appear in their
    standard locations in /dev. Mounting or unmounting the binderfs mount in
    the initial ipc namespace will have no effect on these devices, i.e. they
    will neither show up in the binderfs mount nor will they disappear when the
    binderfs mount is gone.

    /* binder-control */
    Each new binderfs instance comes with a binder-control device. No other
    devices will be present at first. The binder-control device can be used to
    dynamically allocate binder devices. All requests operate on the binderfs
    mount the binder-control device resides in.
    Assuming a new instance of binderfs has been mounted at /dev/binderfs
    via mount -t binderfs binderfs /dev/binderfs. Then a request to create a
    new binder device can be made as illustrated in [2].
    Binderfs devices can simply be removed via unlink().

    /* Implementation details */
    - dynamic major number allocation:
    When binderfs is registered as a new filesystem it will dynamically
    allocate a new major number. The allocated major number will be returned
    in struct binderfs_device when a new binder device is allocated.
    - global minor number tracking:
    Minor are tracked in a global idr struct that is capped at
    BINDERFS_MAX_MINOR. The minor number tracker is protected by a global
    mutex. This is the only point of contention between binderfs mounts.
    - struct binderfs_info:
    Each binderfs super block has its own struct binderfs_info that tracks
    specific details about a binderfs instance:
    - ipc namespace
    - dentry of the binder-control device
    - root uid and root gid of the user namespace the binderfs instance
    was mounted in
    - mountable by user namespace root:
    binderfs can be mounted by user namespace root in a non-initial user
    namespace. The devices will be owned by user namespace root.
    - binderfs binder devices without misc infrastructure:
    New binder devices associated with a binderfs mount do not use the
    full misc_register() infrastructure.
    The misc_register() infrastructure can only create new devices in the
    host's devtmpfs mount. binderfs does however only make devices appear
    under its own mountpoint and thus allocates new character device nodes
    from the inode of the root dentry of the super block. This will have
    the side-effect that binderfs specific device nodes do not appear in
    sysfs. This behavior is similar to devpts allocated pts devices and
    has no effect on the functionality of the ipc mechanism itself.

    [1]: https://goo.gl/JL2tfX
    [2]: program to allocate a new binderfs binder device:

    #define _GNU_SOURCE
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    int main(int argc, char *argv[])
    {
    int fd, ret, saved_errno;
    size_t len;
    struct binderfs_device device = { 0 };

    if (argc < 2)
    exit(EXIT_FAILURE);

    len = strlen(argv[1]);
    if (len > BINDERFS_MAX_NAME)
    exit(EXIT_FAILURE);

    memcpy(device.name, argv[1], len);

    fd = open("/dev/binderfs/binder-control", O_RDONLY | O_CLOEXEC);
    if (fd < 0) {
    printf("%s - Failed to open binder-control device\n",
    strerror(errno));
    exit(EXIT_FAILURE);
    }

    ret = ioctl(fd, BINDER_CTL_ADD, &device);
    saved_errno = errno;
    close(fd);
    errno = saved_errno;
    if (ret < 0) {
    printf("%s - Failed to allocate new binder device\n",
    strerror(errno));
    exit(EXIT_FAILURE);
    }

    printf("Allocated new binder device with major %d, minor %d, and "
    "name %s\n", device.major, device.minor,
    device.name);

    exit(EXIT_SUCCESS);
    }

    Cc: Martijn Coenen
    Cc: Greg Kroah-Hartman
    Signed-off-by: Christian Brauner
    Acked-by: Todd Kjos
    Signed-off-by: Greg Kroah-Hartman

    Christian Brauner