05 Jun, 2020

2 commits

  • Merge yet more updates from Andrew Morton:

    - More MM work. 100ish more to go. Mike Rapoport's "mm: remove
    __ARCH_HAS_5LEVEL_HACK" series should fix the current ppc issue

    - Various other little subsystems

    * emailed patches from Andrew Morton : (127 commits)
    lib/ubsan.c: fix gcc-10 warnings
    tools/testing/selftests/vm: remove duplicate headers
    selftests: vm: pkeys: fix multilib builds for x86
    selftests: vm: pkeys: use the correct page size on powerpc
    selftests/vm/pkeys: override access right definitions on powerpc
    selftests/vm/pkeys: test correct behaviour of pkey-0
    selftests/vm/pkeys: introduce a sub-page allocator
    selftests/vm/pkeys: detect write violation on a mapped access-denied-key page
    selftests/vm/pkeys: associate key on a mapped page and detect write violation
    selftests/vm/pkeys: associate key on a mapped page and detect access violation
    selftests/vm/pkeys: improve checks to determine pkey support
    selftests/vm/pkeys: fix assertion in test_pkey_alloc_exhaust()
    selftests/vm/pkeys: fix number of reserved powerpc pkeys
    selftests/vm/pkeys: introduce powerpc support
    selftests/vm/pkeys: introduce generic pkey abstractions
    selftests: vm: pkeys: use the correct huge page size
    selftests/vm/pkeys: fix alloc_random_pkey() to make it really random
    selftests/vm/pkeys: fix assertion in pkey_disable_set/clear()
    selftests/vm/pkeys: fix pkey_disable_clear()
    selftests: vm: pkeys: add helpers for pkey bits
    ...

    Linus Torvalds
     
  • copy_strings_kernel is always used with a single argument,
    adjust the calling convention to that.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Alexander Viro
    Link: http://lkml.kernel.org/r/20200501104105.2621149-2-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

30 May, 2020

1 commit

  • Move the computation of creds from prepare_binfmt into begin_new_exec
    so that the creds need only be computed once. This is just code
    reorganization no semantic changes of any kind are made.

    Moving the computation is safe. I have looked through the kernel and
    verified none of the binfmts look at bprm->cred directly, and that
    there are no helpers that look at bprm->cred indirectly. Which means
    that it is not a problem to compute the bprm->cred later in the
    execution flow as it is not used until it becomes current->cred.

    A new function bprm_creds_from_file is added to contain the work that
    needs to be done. bprm_creds_from_file first computes which file
    bprm->executable or most likely bprm->file that the bprm->creds
    will be computed from.

    The funciton bprm_fill_uid is updated to receive the file instead of
    accessing bprm->file. The now unnecessary work needed to reset the
    bprm->cred->euid, and bprm->cred->egid is removed from brpm_fill_uid.
    A small comment to document that bprm_fill_uid now only deals with the
    work to handle suid and sgid files. The default case is already
    heandled by prepare_exec_creds.

    The function security_bprm_repopulate_creds is renamed
    security_bprm_creds_from_file and now is explicitly passed the file
    from which to compute the creds. The documentation of the
    bprm_creds_from_file security hook is updated to explain when the hook
    is called and what it needs to do. The file is passed from
    cap_bprm_creds_from_file into get_file_caps so that the caps are
    computed for the appropriate file. The now unnecessary work in
    cap_bprm_creds_from_file to reset the ambient capabilites has been
    removed. A small comment to document that the work of
    cap_bprm_creds_from_file is to read capabilities from the files
    secureity attribute and derive capabilities from the fact the
    user had uid 0 has been added.

    Reviewed-by: Kees Cook
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

21 May, 2020

4 commits

  • Recursion in kernel code is generally a bad idea as it can overflow
    the kernel stack. Recursion in exec also hides that the code is
    looping and that the loop changes bprm->file.

    Instead of recursing in search_binary_handler have the methods that
    would recurse set bprm->interpreter and return 0. Modify exec_binprm
    to loop when bprm->interpreter is set. Consolidate all of the
    reassignments of bprm->file in that loop to make it clear what is
    going on.

    The structure of the new loop in exec_binprm is that all errors return
    immediately, while successful completion (ret == 0 &&
    !bprm->interpreter) just breaks out of the loop and runs what
    exec_bprm has always run upon successful completion.

    Fail if the an interpreter is being call after execfd has been set.
    The code has never properly handled an interpreter being called with
    execfd being set and with reassignments of bprm->file and the
    assignment of bprm->executable in generic code it has finally become
    possible to test and fail when if this problematic condition happens.

    With the reassignments of bprm->file and the assignment of
    bprm->executable moved into the generic code add a test to see if
    bprm->executable is being reassigned.

    In search_binary_handler remove the test for !bprm->file. With all
    reassignments of bprm->file moved to exec_binprm bprm->file can never
    be NULL in search_binary_handler.

    Link: https://lkml.kernel.org/r/87sgfwyd84.fsf_-_@x220.int.ebiederm.org
    Acked-by: Linus Torvalds
    Reviewed-by: Kees Cook
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Most of the support for passing the file descriptor of an executable
    to an interpreter already lives in the generic code and in binfmt_elf.
    Rework the fields in binfmt_elf that deal with executable file
    descriptor passing to make executable file descriptor passing a first
    class concept.

    Move the fd_install from binfmt_misc into begin_new_exec after the new
    creds have been installed. This means that accessing the file through
    /proc//fd/N is able to see the creds for the new executable
    before allowing access to the new executables files.

    Performing the install of the executables file descriptor after
    the point of no return also means that nothing special needs to
    be done on error. The exiting of the process will close all
    of it's open files.

    Move the would_dump from binfmt_misc into begin_new_exec right
    after would_dump is called on the bprm->file. This makes it
    obvious this case exists and that no nesting of bprm->file is
    currently supported.

    In binfmt_misc the movement of fd_install into generic code means
    that it's special error exit path is no longer needed.

    Link: https://lkml.kernel.org/r/87y2poyd91.fsf_-_@x220.int.ebiederm.org
    Acked-by: Linus Torvalds
    Reviewed-by: Kees Cook
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • The code in prepare_binary_handler needs to be run every time
    search_binary_handler is called so move the call into search_binary_handler
    itself to make the code simpler and easier to understand.

    Link: https://lkml.kernel.org/r/87d070zrvx.fsf_-_@x220.int.ebiederm.org
    Acked-by: Linus Torvalds
    Reviewed-by: Kees Cook
    Reviewed-by: James Morris
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Add a flag preserve_creds that binfmt_misc can set to prevent
    credentials from being updated. This allows binfmt_misc to always
    call prepare_binprm. Allowing the credential computation logic to be
    consolidated.

    Not replacing the credentials with the interpreters credentials is
    safe because because an open file descriptor to the executable is
    passed to the interpreter. As the interpreter does not need to
    reopen the executable it is guaranteed to see the same file that
    exec sees.

    Ref: c407c033de84 ("[PATCH] binfmt_misc: improve calculation of interpreter's credentials")
    Link: https://lkml.kernel.org/r/87imgszrwo.fsf_-_@x220.int.ebiederm.org
    Acked-by: Linus Torvalds
    Reviewed-by: Kees Cook
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

20 Jul, 2019

1 commit

  • Pull vfs mount updates from Al Viro:
    "The first part of mount updates.

    Convert filesystems to use the new mount API"

    * 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
    mnt_init(): call shmem_init() unconditionally
    constify ksys_mount() string arguments
    don't bother with registering rootfs
    init_rootfs(): don't bother with init_ramfs_fs()
    vfs: Convert smackfs to use the new mount API
    vfs: Convert selinuxfs to use the new mount API
    vfs: Convert securityfs to use the new mount API
    vfs: Convert apparmorfs to use the new mount API
    vfs: Convert openpromfs to use the new mount API
    vfs: Convert xenfs to use the new mount API
    vfs: Convert gadgetfs to use the new mount API
    vfs: Convert oprofilefs to use the new mount API
    vfs: Convert ibmasmfs to use the new mount API
    vfs: Convert qib_fs/ipathfs to use the new mount API
    vfs: Convert efivarfs to use the new mount API
    vfs: Convert configfs to use the new mount API
    vfs: Convert binfmt_misc to use the new mount API
    convenience helper: get_tree_single()
    convenience helper get_tree_nodev()
    vfs: Kill sget_userns()
    ...

    Linus Torvalds
     

05 Jul, 2019

1 commit

  • Convert the binfmt_misc filesystem to the new internal mount API as the old
    one will be obsoleted and removed. This allows greater flexibility in
    communication of mount parameters between userspace, the VFS and the
    filesystem.

    See Documentation/filesystems/mount_api.txt for more information.

    Signed-off-by: David Howells
    cc: Alexander Viro
    cc: linux-fsdevel@vger.kernel.org
    Signed-off-by: Al Viro

    David Howells
     

21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have MODULE_LICENCE("GPL*") inside which was used in the initial
    scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

11 Jul, 2018

1 commit

  • it's exactly the same thing as
    dentry_open(&file->f_path, file->f_flags, file->f_cred)

    ... and rename it to file_clone_open(), while we are at it.
    'filp' naming convention is bogus; sure, it's "file pointer",
    but we generally don't do that kind of Hungarian notation.
    Some of the instances have too many callers to touch, but this
    one has only two, so let's sanitize it while we can...

    Acked-by: Linus Torvalds
    Signed-off-by: Al Viro

    Al Viro
     

16 Jun, 2018

1 commit

  • As we move stuff around, some doc references are broken. Fix some of
    them via this script:
    ./scripts/documentation-file-ref-check --fix

    Manually checked that produced results are valid.

    Acked-by: Matthias Brugger
    Acked-by: Takashi Iwai
    Acked-by: Jeff Kirsher
    Acked-by: Guenter Roeck
    Acked-by: Miguel Ojeda
    Signed-off-by: Mauro Carvalho Chehab
    Acked-by: Jonathan Corbet

    Mauro Carvalho Chehab
     

08 Jun, 2018

1 commit

  • WHen registering a new binfmt_misc handler, it is possible to overflow
    the offset to get a negative value, which might crash the system, or
    possibly leak kernel data.

    Here is a crash log when 2500000000 was used as an offset:

    BUG: unable to handle kernel paging request at ffff989cfd6edca0
    IP: load_misc_binary+0x22b/0x470 [binfmt_misc]
    PGD 1ef3e067 P4D 1ef3e067 PUD 0
    Oops: 0000 [#1] SMP NOPTI
    Modules linked in: binfmt_misc kvm_intel ppdev kvm irqbypass joydev input_leds serio_raw mac_hid parport_pc qemu_fw_cfg parpy
    CPU: 0 PID: 2499 Comm: bash Not tainted 4.15.0-22-generic #24-Ubuntu
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-1 04/01/2014
    RIP: 0010:load_misc_binary+0x22b/0x470 [binfmt_misc]
    Call Trace:
    search_binary_handler+0x97/0x1d0
    do_execveat_common.isra.34+0x667/0x810
    SyS_execve+0x31/0x40
    do_syscall_64+0x73/0x130
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2

    Use kstrtoint instead of simple_strtoul. It will work as the code
    already set the delimiter byte to '\0' and we only do it when the field
    is not empty.

    Tested with offsets -1, 2500000000, UINT_MAX and INT_MAX. Also tested
    with examples documented at Documentation/admin-guide/binfmt-misc.rst
    and other registrations from packages on Ubuntu.

    Link: http://lkml.kernel.org/r/20180529135648.14254-1-cascardo@canonical.com
    Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
    Signed-off-by: Thadeu Lima de Souza Cascardo
    Reviewed-by: Andrew Morton
    Cc: Alexander Viro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thadeu Lima de Souza Cascardo
     

03 Apr, 2018

1 commit

  • Using the ksys_close() wrapper allows us to get rid of in-kernel calls
    to the sys_close() syscall. The ksys_ prefix denotes that this function
    is meant as a drop-in replacement for the syscall. In particular, it
    uses the same calling convention as sys_close(), with one subtle
    difference:

    The few places which checked the return value did not care about the return
    value re-writing in sys_close(), so simply use a wrapper around
    __close_fd().

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Cc: Al Viro
    Cc: Andrew Morton
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     

14 Oct, 2017

1 commit

  • inode->i_private is assigned by a Node pointer only after registering a
    new binary format, so it could be NULL if inode was created by
    bm_fill_super() (or iput() was called by the error path in
    bm_register_write()), and this could result in NULL pointer dereference
    when evicting such an inode. e.g. mount binfmt_misc filesystem then
    umount it immediately:

    mount -t binfmt_misc binfmt_misc /proc/sys/fs/binfmt_misc
    umount /proc/sys/fs/binfmt_misc

    will result in

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000013
    IP: bm_evict_inode+0x16/0x40 [binfmt_misc]
    ...
    Call Trace:
    evict+0xd3/0x1a0
    iput+0x17d/0x1d0
    dentry_unlink_inode+0xb9/0xf0
    __dentry_kill+0xc7/0x170
    shrink_dentry_list+0x122/0x280
    shrink_dcache_parent+0x39/0x90
    do_one_tree+0x12/0x40
    shrink_dcache_for_umount+0x2d/0x90
    generic_shutdown_super+0x1f/0x120
    kill_litter_super+0x29/0x40
    deactivate_locked_super+0x43/0x70
    deactivate_super+0x45/0x60
    cleanup_mnt+0x3f/0x70
    __cleanup_mnt+0x12/0x20
    task_work_run+0x86/0xa0
    exit_to_usermode_loop+0x6d/0x99
    syscall_return_slowpath+0xba/0xf0
    entry_SYSCALL_64_fastpath+0xa3/0xa

    Fix it by making sure Node (e) is not NULL.

    Link: http://lkml.kernel.org/r/20171010100642.31786-1-eguan@redhat.com
    Fixes: 83f918274e4b ("exec: binfmt_misc: shift filp_close(interp_file) from kill_node() to bm_evict_inode()")
    Signed-off-by: Eryu Guan
    Acked-by: Oleg Nesterov
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eryu Guan
     

04 Oct, 2017

5 commits

  • After the previous change "fmt" can't go away, we can kill
    iname/iname_addr and use fmt->interpreter.

    Link: http://lkml.kernel.org/r/20170922143653.GA17232@redhat.com
    Signed-off-by: Oleg Nesterov
    Acked-by: Kees Cook
    Cc: Al Viro
    Cc: Ben Woodard
    Cc: James Bottomley
    Cc: Jim Foraker
    Cc:
    Cc: Travis Gummels
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • load_misc_binary() makes a local copy of fmt->interpreter under
    entries_lock to avoid the race with kill_node() but this is not enough;
    the whole Node can be freed after we drop entries_lock, not only the
    ->interpreter string.

    Add dget/dput(fmt->dentry) to ensure bm_evict_inode() can't destroy/free
    this Node.

    Link: http://lkml.kernel.org/r/20170922143650.GA17227@redhat.com
    Signed-off-by: Oleg Nesterov
    Acked-by: Kees Cook
    Cc: Al Viro
    Cc: Ben Woodard
    Cc: James Bottomley
    Cc: Jim Foraker
    Cc: Travis Gummels
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • If MISC_FMT_OPEN_FILE flag is set e->interp_file must be valid or we
    have a bug which should not be silently ignored.

    Link: http://lkml.kernel.org/r/20170922143647.GA17222@redhat.com
    Signed-off-by: Oleg Nesterov
    Acked-by: Kees Cook
    Cc: Al Viro
    Cc: Ben Woodard
    Cc: James Bottomley
    Cc: Jim Foraker
    Cc:
    Cc: Travis Gummels
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • To ensure that load_misc_binary() can't use the partially destroyed
    Node, see also the next patch.

    The current logic looks wrong in any case, once we close interp_file it
    doesn't make any sense to delay kfree(inode->i_private), this Node is no
    longer valid. Even if the MISC_FMT_OPEN_FILE/interp_file checks were
    not racy (they are), load_misc_binary() should not try to reopen
    ->interpreter if MISC_FMT_OPEN_FILE is set but ->interp_file is NULL.

    And I can't understand why do we use filp_close(), not fput().

    Link: http://lkml.kernel.org/r/20170922143644.GA17216@redhat.com
    Signed-off-by: Oleg Nesterov
    Acked-by: Kees Cook
    Cc: Al Viro
    Cc: Ben Woodard
    Cc: James Bottomley
    Cc: Jim Foraker
    Cc:
    Cc: Travis Gummels
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • kill_node() nullifies/checks Node->dentry to avoid double free. This
    complicates the next changes and this is very confusing:

    - we do not need to check dentry != NULL under entries_lock,
    kill_node() is always called under inode_lock(d_inode(root)) and we
    rely on this inode_lock() anyway, without this lock the
    MISC_FMT_OPEN_FILE cleanup could race with itself.

    - if kill_inode() was already called and ->dentry == NULL we should not
    even try to close e->interp_file.

    We can change bm_entry_write() to simply check !list_empty(list) before
    kill_node. Again, we rely on inode_lock(), in particular it saves us
    from the race with bm_status_write(), another caller of kill_node().

    Link: http://lkml.kernel.org/r/20170922143641.GA17210@redhat.com
    Signed-off-by: Oleg Nesterov
    Acked-by: Kees Cook
    Cc: Al Viro
    Cc: Ben Woodard
    Cc: James Bottomley
    Cc: Jim Foraker
    Cc:
    Cc: Travis Gummels
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

05 Sep, 2017

1 commit

  • Use proper ssize_t and size_t types for the return value and count
    argument, move the offset last and make it an in/out argument like
    all other read/write helpers, and make the buf argument a void pointer
    to get rid of lots of casts in the callers.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

27 Apr, 2017

1 commit

  • simple_fill_super() is passed an array of tree_descr structures which
    describe the files to create in the filesystem's root directory. Since
    these arrays are never modified intentionally, they should be 'const' so
    that they are placed in .rodata and benefit from memory protection.
    This patch updates the function signature and all users, and also
    constifies tree_descr.name.

    Signed-off-by: Eric Biggers
    Signed-off-by: Al Viro

    Eric Biggers
     

02 Mar, 2017

1 commit


28 Sep, 2016

1 commit

  • current_fs_time() uses struct super_block* as an argument.
    As per Linus's suggestion, this is changed to take struct
    inode* as a parameter instead. This is because the function
    is primarily meant for vfs inode timestamps.
    Also the function was renamed as per Arnd's suggestion.

    Change all calls to current_fs_time() to use the new
    current_time() function instead. current_fs_time() will be
    deleted.

    Signed-off-by: Deepa Dinamani
    Signed-off-by: Al Viro

    Deepa Dinamani
     

07 Aug, 2016

1 commit

  • Pull binfmt_misc update from James Bottomley:
    "This update is to allow architecture emulation containers to function
    such that the emulation binary can be housed outside the container
    itself. The container and fs parts both have acks from relevant
    experts.

    To use the new feature you have to add an F option to your binfmt_misc
    configuration"

    From the docs:
    "The usual behaviour of binfmt_misc is to spawn the binary lazily when
    the misc format file is invoked. However, this doesn't work very well
    in the face of mount namespaces and changeroots, so the F mode opens
    the binary as soon as the emulation is installed and uses the opened
    image to spawn the emulator, meaning it is always available once
    installed, regardless of how the environment changes"

    * tag 'binfmt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/binfmt_misc:
    binfmt_misc: add F option description to documentation
    binfmt_misc: add persistent opened binary handler for containers
    fs: add filp_clone_open API

    Linus Torvalds
     

30 May, 2016

1 commit


31 Mar, 2016

1 commit

  • This patch adds a new flag 'F' to the binfmt handlers. If you pass in
    'F' the binary that runs the emulation will be opened immediately and
    in future, will be cloned from the open file.

    The net effect is that the handler survives both changeroots and mount
    namespace changes, making it easy to work with foreign architecture
    containers without contaminating the container image with the
    emulator.

    Signed-off-by: James Bottomley
    Acked-by: Serge Hallyn

    James Bottomley
     

23 Jan, 2016

1 commit

  • parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
    inode_foo(inode) being mutex_foo(&inode->i_mutex).

    Please, use those for access to ->i_mutex; over the coming cycle
    ->i_mutex will become rwsem, with ->lookup() done with it held
    only shared.

    Signed-off-by: Al Viro

    Al Viro
     

27 Apr, 2015

1 commit

  • Pull fourth vfs update from Al Viro:
    "d_inode() annotations from David Howells (sat in for-next since before
    the beginning of merge window) + four assorted fixes"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    RCU pathwalk breakage when running into a symlink overmounting something
    fix I_DIO_WAKEUP definition
    direct-io: only inc/dec inode->i_dio_count for file systems
    fs/9p: fix readdir()
    VFS: assorted d_backing_inode() annotations
    VFS: fs/inode.c helpers: d_inode() annotations
    VFS: fs/cachefiles: d_backing_inode() annotations
    VFS: fs library helpers: d_inode() annotations
    VFS: assorted weird filesystems: d_inode() annotations
    VFS: normal filesystems (and lustre): d_inode() annotations
    VFS: security/: d_inode() annotations
    VFS: security/: d_backing_inode() annotations
    VFS: net/: d_inode() annotations
    VFS: net/unix: d_backing_inode() annotations
    VFS: kernel/: d_inode() annotations
    VFS: audit: d_backing_inode() annotations
    VFS: Fix up some ->d_inode accesses in the chelsio driver
    VFS: Cachefiles should perform fs modifications on the top layer only
    VFS: AF_UNIX sockets should call mknod on the top layer only

    Linus Torvalds
     

17 Apr, 2015

1 commit

  • sprintf() reliably returns the number of characters printed, so we don't
    need to ask strlen() where we are. Also replace calling sprintf("%02x")
    in a loop with the much simpler bin2hex().

    [akpm@linux-foundation.org: it's odd to include kernel.h after everything else]
    Signed-off-by: Rasmus Villemoes
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     

16 Apr, 2015

1 commit


17 Dec, 2014

1 commit

  • scanarg(s, del) never returns s; the empty field results in s + 1.
    Restore the correct checks, and move NUL-termination into scanarg(),
    while we are at it.

    Incidentally, mixing "coding style cleanups" (for small values of cleanup)
    with functional changes is a Bad Idea(tm)...

    Signed-off-by: Al Viro

    Al Viro
     

14 Dec, 2014

1 commit

  • This patchset adds execveat(2) for x86, and is derived from Meredydd
    Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).

    The primary aim of adding an execveat syscall is to allow an
    implementation of fexecve(3) that does not rely on the /proc filesystem,
    at least for executables (rather than scripts). The current glibc version
    of fexecve(3) is implemented via /proc, which causes problems in sandboxed
    or otherwise restricted environments.

    Given the desire for a /proc-free fexecve() implementation, HPA suggested
    (https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
    an appropriate generalization.

    Also, having a new syscall means that it can take a flags argument without
    back-compatibility concerns. The current implementation just defines the
    AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
    added in future -- for example, flags for new namespaces (as suggested at
    https://lkml.org/lkml/2006/7/11/474).

    Related history:
    - https://lkml.org/lkml/2006/12/27/123 is an example of someone
    realizing that fexecve() is likely to fail in a chroot environment.
    - http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
    documenting the /proc requirement of fexecve(3) in its manpage, to
    "prevent other people from wasting their time".
    - https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
    problem where a process that did setuid() could not fexecve()
    because it no longer had access to /proc/self/fd; this has since
    been fixed.

    This patch (of 4):

    Add a new execveat(2) system call. execveat() is to execve() as openat()
    is to open(): it takes a file descriptor that refers to a directory, and
    resolves the filename relative to that.

    In addition, if the filename is empty and AT_EMPTY_PATH is specified,
    execveat() executes the file to which the file descriptor refers. This
    replicates the functionality of fexecve(), which is a system call in other
    UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/" (and
    so relies on /proc being mounted).

    The filename fed to the executed program as argv[0] (or the name of the
    script fed to a script interpreter) will be of the form "/dev/fd/"
    (for an empty filename) or "/dev/fd//", effectively
    reflecting how the executable was found. This does however mean that
    execution of a script in a /proc-less environment won't work; also, script
    execution via an O_CLOEXEC file descriptor fails (as the file will not be
    accessible after exec).

    Based on patches by Meredydd Luff.

    Signed-off-by: David Drysdale
    Cc: Meredydd Luff
    Cc: Shuah Khan
    Cc: "Eric W. Biederman"
    Cc: Andy Lutomirski
    Cc: Alexander Viro
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Kees Cook
    Cc: Arnd Bergmann
    Cc: Rich Felker
    Cc: Christoph Hellwig
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Drysdale
     

11 Dec, 2014

4 commits

  • GFP_USER means "honour cpuset nodes-allowed beancounting". These are
    regular old kernel objects and there seems no reason to give them this
    treatment.

    Acked-by: Mike Frysinger
    Cc: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Clean up various coding style issues that checkpatch complains about.
    No functional changes here.

    Signed-off-by: Mike Frysinger
    Cc: Al Viro
    Cc: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Frysinger
     
  • When trying to develop a custom format handler, the errors returned all
    effectively get bucketed as EINVAL with no kernel messages. The other
    errors (ENOMEM/EFAULT) are internal/obvious and basic. Thus any time a
    bad handler is rejected, the developer has to walk the dense code and
    try to guess where it went wrong. Needing to dive into kernel code is
    itself a fairly high barrier for a lot of people.

    To improve this situation, let's deploy extensive pr_debug markers at
    logical parse points, and add comments to the dense parsing logic. It
    let's you see exactly where the parsing aborts, the string the kernel
    received (useful when dealing with shell code), how it translated the
    buffers to binary data, and how it will apply the mask at runtime.

    Some example output:
    $ echo ':qemu-foo:M::\x7fELF\xAD\xAD\x01\x00:\xff\xff\xff\xff\xff\x00\xff\x00:/usr/bin/qemu-foo:POC' > register
    $ dmesg
    binfmt_misc: register: received 92 bytes
    binfmt_misc: register: delim: 0x3a {:}
    binfmt_misc: register: name: {qemu-foo}
    binfmt_misc: register: type: M (magic)
    binfmt_misc: register: offset: 0x0
    binfmt_misc: register: magic[raw]: 5c 78 37 66 45 4c 46 5c 78 41 44 5c 78 41 44 5c \x7fELF\xAD\xAD\
    binfmt_misc: register: magic[raw]: 78 30 31 5c 78 30 30 00 x01\x00.
    binfmt_misc: register: mask[raw]: 5c 78 66 66 5c 78 66 66 5c 78 66 66 5c 78 66 66 \xff\xff\xff\xff
    binfmt_misc: register: mask[raw]: 5c 78 66 66 5c 78 30 30 5c 78 66 66 5c 78 30 30 \xff\x00\xff\x00
    binfmt_misc: register: mask[raw]: 00 .
    binfmt_misc: register: magic/mask length: 8
    binfmt_misc: register: magic[decoded]: 7f 45 4c 46 ad ad 01 00 .ELF....
    binfmt_misc: register: mask[decoded]: ff ff ff ff ff 00 ff 00 ........
    binfmt_misc: register: magic[masked]: 7f 45 4c 46 ad 00 01 00 .ELF....
    binfmt_misc: register: interpreter: {/usr/bin/qemu-foo}
    binfmt_misc: register: flag: P (preserve argv0)
    binfmt_misc: register: flag: O (open binary)
    binfmt_misc: register: flag: C (preserve creds)

    The [raw] lines show us exactly what was received from userspace. The
    lines after that show us how the kernel has decoded things.

    Signed-off-by: Mike Frysinger
    Cc: Al Viro
    Cc: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Frysinger
     
  • This patch replaces calls to get_unused_fd() with equivalent call to
    get_unused_fd_flags(0) to preserve current behavor for existing code.

    In a further patch, get_unused_fd() will be removed so that new code start
    using get_unused_fd_flags(), with the hope O_CLOEXEC could be used, either
    by default or choosen by userspace.

    Signed-off-by: Yann Droneaud
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yann Droneaud
     

14 Oct, 2014

2 commits

  • gcc-4.9 on ARM gives us a mysterious warning about the binfmt_misc
    parse_command function:

    fs/binfmt_misc.c: In function 'parse_command.part.3':
    fs/binfmt_misc.c:405:7: warning: array subscript is above array bounds [-Warray-bounds]

    I've managed to trace this back to the ARM implementation of memset,
    which is called from copy_from_user in case of a fault and which does

    #define memset(p,v,n) \
    ({ \
    void *__p = (p); size_t __n = n; \
    if ((__n) != 0) { \
    if (__builtin_constant_p((v)) && (v) == 0) \
    __memzero((__p),(__n)); \
    else \
    memset((__p),(v),(__n)); \
    } \
    (__p); \
    })

    Apparently gcc gets confused by the check for "size != 0" and believes
    that the size might be zero when it gets to the line that does "if
    (s[count-1] == '\n')", so it would access data outside of the array.

    gcc is clearly wrong here, since this condition was already checked
    earlier in the function and the 'size' value can not change in the
    meantime.

    Fortunately, we can work around it and get rid of the warning by
    rearranging the function to check for zero size after doing the
    copy_from_user. It is still safe to pass a zero size into
    copy_from_user, so it does not cause any side effects.

    Signed-off-by: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     
  • The current code places a 256 byte limit on the registration format.
    This ends up being fairly limited when you try to do matching against a
    binary format like ELF:

    - the magic & mask formats cannot have any embedded NUL chars
    (string_unescape_inplace halts at the first NUL)
    - each escape sequence quadruples the size: \x00 is needed for NUL
    - trying to match bytes at the start of the file as well as further
    on leads to a lot of \x00 sequences in the mask
    - magic & mask have to be the same length (when decoded)
    - still need bytes for the other fields
    - impossible!

    Let's look at a concrete (and common) example: using QEMU to run MIPS
    ELFs. The name field uses 11 bytes "qemu-mipsel". The interp uses 20
    bytes "/usr/bin/qemu-mipsel". The type & flags takes up 4 bytes. We
    need 7 bytes for the delimiter (usually ":"). We can skip offset. So
    already we're down to 107 bytes to use with the magic/mask instead of
    the real limit of 128 (BINPRM_BUF_SIZE). If people use shell code to
    register (which they do the majority of the time), they're down to ~26
    possible bytes since the escape sequence must be \x##.

    The ELF format looks like (both 32 & 64 bit):

    e_ident: 16 bytes
    e_type: 2 bytes
    e_machine: 2 bytes

    Those 20 bytes are enough for most architectures because they have so few
    formats in the first place, thus they can be uniquely identified. That
    also means for shell users, since 20 is smaller than 26, they can sanely
    register a handler.

    But for some targets (like MIPS), we need to poke further. The ELF fields
    continue on:

    e_entry: 4 or 8 bytes
    e_phoff: 4 or 8 bytes
    e_shoff: 4 or 8 bytes
    e_flags: 4 bytes

    We only care about e_flags here as that includes the bits to identify
    whether the ELF is O32/N32/N64. But now we have to consume another 16
    bytes (for 32 bit ELFs) or 28 bytes (for 64 bit ELFs) just to match the
    flags. If every byte is escaped, we send 288 more bytes to the kernel
    ((20 {e_ident,e_type,e_machine} + 12 {e_entry,e_phoff,e_shoff} + 4
    {e_flags}) * 2 {mask,magic} * 4 {escape}) and we've clearly blown our
    budget.

    Even if we try to be clever and do the decoding ourselves (rather than
    relying on the kernel to process \x##), we still can't hit the mark --
    string_unescape_inplace treats mask & magic as C strings so NUL cannot
    be embedded. That leaves us with having to pass \x00 for the 12/24
    entry/phoff/shoff bytes (as those will be completely random addresses),
    and that is a minimum requirement of 48/96 bytes for the mask alone.
    Add up the rest and we blow through it (this is for 64 bit ELFs):
    magic: 20 {e_ident,e_type,e_machine} + 24 {e_entry,e_phoff,e_shoff} +
    4 {e_flags} = 48 # ^^ See note below.
    mask: 20 {e_ident,e_type,e_machine} + 96 {e_entry,e_phoff,e_shoff} +
    4 {e_flags} = 120
    Remember above we had 107 left over, and now we're at 168. This is of
    course the *best* case scenario -- you'll also want to have NUL bytes
    in the magic & mask too to match literal zeros.

    Note: the reason we can use 24 in the magic is that we can work off of the
    fact that for bytes the mask would clobber, we can stuff any value into
    magic that we want. So when mask is \x00, we don't need the magic to also
    be \x00, it can be an unescaped raw byte like '!'. This lets us handle
    more formats (barely) under the current 256 limit, but that's a pretty
    tall hoop to force people to jump through.

    With all that said, let's bump the limit from 256 bytes to 1920. This way
    we support escaping every byte of the mask & magic field (which is 1024
    bytes by themselves -- 128 * 4 * 2), and we leave plenty of room for other
    fields. Like long paths to the interpreter (when you have source in your
    /really/long/homedir/qemu/foo). Since the current code stuffs more than
    one structure into the same buffer, we leave a bit of space to easily
    round up to 2k. 1920 is just as arbitrary as 256 ;).

    Signed-off-by: Mike Frysinger
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Frysinger
     

04 Apr, 2014

1 commit