18 Jul, 2015

1 commit

  • Some modules call config_item_init_type_name() and config_group_init_type_name()
    with parameter "name" directly controlled by userspace. These two
    functions call config_item_set_name() with this name used as a format
    string, which can be used to leak information such as content of the
    stack to userspace.

    For example, make_netconsole_target() in netconsole module calls
    config_item_init_type_name() with the name of a newly-created directory.
    This means that the following commands give some unexpected output, with
    configfs mounted in /sys/kernel/config/ and on a system with a
    configured eth0 ethernet interface:

    # modprobe netconsole
    # mkdir /sys/kernel/config/netconsole/target_%lx
    # echo eth0 > /sys/kernel/config/netconsole/target_%lx/dev_name
    # echo 1 > /sys/kernel/config/netconsole/target_%lx/enabled
    # echo eth0 > /sys/kernel/config/netconsole/target_%lx/dev_name
    # dmesg |tail -n1
    [ 142.697668] netconsole: target (target_ffffffffc0ae8080) is
    enabled, disable to update parameters

    The directory name is correct but %lx has been interpreted in the
    internal item name, displayed here in the error message used by
    store_dev_name() in drivers/net/netconsole.c.

    To fix this, update every caller of config_item_set_name to use "%s"
    when operating on untrusted input.

    This issue was found using -Wformat-security gcc flag, once a __printf
    attribute has been added to config_item_set_name().

    Signed-off-by: Nicolas Iooss
    Acked-by: Greg Kroah-Hartman
    Acked-by: Felipe Balbi
    Acked-by: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicolas Iooss
     

05 Jul, 2015

1 commit

  • Pull more vfs updates from Al Viro:
    "Assorted VFS fixes and related cleanups (IMO the most interesting in
    that part are f_path-related things and Eric's descriptor-related
    stuff). UFS regression fixes (it got broken last cycle). 9P fixes.
    fs-cache series, DAX patches, Jan's file_remove_suid() work"

    [ I'd say this is much more than "fixes and related cleanups". The
    file_table locking rule change by Eric Dumazet is a rather big and
    fundamental update even if the patch isn't huge. - Linus ]

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits)
    9p: cope with bogus responses from server in p9_client_{read,write}
    p9_client_write(): avoid double p9_free_req()
    9p: forgetting to cancel request on interrupted zero-copy RPC
    dax: bdev_direct_access() may sleep
    block: Add support for DAX reads/writes to block devices
    dax: Use copy_from_iter_nocache
    dax: Add block size note to documentation
    fs/file.c: __fget() and dup2() atomicity rules
    fs/file.c: don't acquire files->file_lock in fd_install()
    fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation
    vfs: avoid creation of inode number 0 in get_next_ino
    namei: make set_root_rcu() return void
    make simple_positive() public
    ufs: use dir_pages instead of ufs_dir_pages()
    pagemap.h: move dir_pages() over there
    remove the pointless include of lglock.h
    fs: cleanup slight list_entry abuse
    xfs: Correctly lock inode when removing suid and file capabilities
    fs: Call security_ops->inode_killpriv on truncate
    fs: Provide function telling whether file_remove_privs() will do anything
    ...

    Linus Torvalds
     

04 Jul, 2015

1 commit

  • Pull user namespace updates from Eric Biederman:
    "Long ago and far away when user namespaces where young it was realized
    that allowing fresh mounts of proc and sysfs with only user namespace
    permissions could violate the basic rule that only root gets to decide
    if proc or sysfs should be mounted at all.

    Some hacks were put in place to reduce the worst of the damage could
    be done, and the common sense rule was adopted that fresh mounts of
    proc and sysfs should allow no more than bind mounts of proc and
    sysfs. Unfortunately that rule has not been fully enforced.

    There are two kinds of gaps in that enforcement. Only filesystems
    mounted on empty directories of proc and sysfs should be ignored but
    the test for empty directories was insufficient. So in my tree
    directories on proc, sysctl and sysfs that will always be empty are
    created specially. Every other technique is imperfect as an ordinary
    directory can have entries added even after a readdir returns and
    shows that the directory is empty. Special creation of directories
    for mount points makes the code in the kernel a smidge clearer about
    it's purpose. I asked container developers from the various container
    projects to help test this and no holes were found in the set of mount
    points on proc and sysfs that are created specially.

    This set of changes also starts enforcing the mount flags of fresh
    mounts of proc and sysfs are consistent with the existing mount of
    proc and sysfs. I expected this to be the boring part of the work but
    unfortunately unprivileged userspace winds up mounting fresh copies of
    proc and sysfs with noexec and nosuid clear when root set those flags
    on the previous mount of proc and sysfs. So for now only the atime,
    read-only and nodev attributes which userspace happens to keep
    consistent are enforced. Dealing with the noexec and nosuid
    attributes remains for another time.

    This set of changes also addresses an issue with how open file
    descriptors from /proc//ns/* are displayed. Recently readlink of
    /proc//fd has been triggering a WARN_ON that has not been
    meaningful since it was added (as all of the code in the kernel was
    converted) and is not now actively wrong.

    There is also a short list of issues that have not been fixed yet that
    I will mention briefly.

    It is possible to rename a directory from below to above a bind mount.
    At which point any directory pointers below the renamed directory can
    be walked up to the root directory of the filesystem. With user
    namespaces enabled a bind mount of the bind mount can be created
    allowing the user to pick a directory whose children they can rename
    to outside of the bind mount. This is challenging to fix and doubly
    so because all obvious solutions must touch code that is in the
    performance part of pathname resolution.

    As mentioned above there is also a question of how to ensure that
    developers by accident or with purpose do not introduce exectuable
    files on sysfs and proc and in doing so introduce security regressions
    in the current userspace that will not be immediately obvious and as
    such are likely to require breaking userspace in painful ways once
    they are recognized"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    vfs: Remove incorrect debugging WARN in prepend_path
    mnt: Update fs_fully_visible to test for permanently empty directories
    sysfs: Create mountpoints with sysfs_create_mount_point
    sysfs: Add support for permanently empty directories to serve as mount points.
    kernfs: Add support for always empty directories.
    proc: Allow creating permanently empty directories that serve as mount points
    sysctl: Allow creating permanently empty directories that serve as mountpoints.
    fs: Add helper functions for permanently empty directories.
    vfs: Ignore unlocked mounts in fs_fully_visible
    mnt: Modify fs_fully_visible to deal with locked ro nodev and atime
    mnt: Refactor the logic for mounting sysfs and proc in a user namespace

    Linus Torvalds
     

01 Jul, 2015

1 commit

  • This allows for better documentation in the code and
    it allows for a simpler and fully correct version of
    fs_fully_visible to be written.

    The mount points converted and their filesystems are:
    /sys/hypervisor/s390/ s390_hypfs
    /sys/kernel/config/ configfs
    /sys/kernel/debug/ debugfs
    /sys/firmware/efi/efivars/ efivarfs
    /sys/fs/fuse/connections/ fusectl
    /sys/fs/pstore/ pstore
    /sys/kernel/tracing/ tracefs
    /sys/fs/cgroup/ cgroup
    /sys/kernel/security/ securityfs
    /sys/fs/selinux/ selinuxfs
    /sys/fs/smackfs/ smackfs

    Cc: stable@vger.kernel.org
    Acked-by: Greg Kroah-Hartman
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

25 Jun, 2015

1 commit


24 Jun, 2015

1 commit


11 May, 2015

4 commits

  • similar to kfree_put_link()

    Signed-off-by: Al Viro

    Al Viro
     
  • only one instance looks at that argument at all; that sole
    exception wants inode rather than dentry.

    Signed-off-by: Al Viro

    Al Viro
     
  • its only use is getting passed to nd_jump_link(), which can obtain
    it from current->nameidata

    Signed-off-by: Al Viro

    Al Viro
     
  • a) instead of storing the symlink body (via nd_set_link()) and returning
    an opaque pointer later passed to ->put_link(), ->follow_link() _stores_
    that opaque pointer (into void * passed by address by caller) and returns
    the symlink body. Returning ERR_PTR() on error, NULL on jump (procfs magic
    symlinks) and pointer to symlink body for normal symlinks. Stored pointer
    is ignored in all cases except the last one.

    Storing NULL for opaque pointer (or not storing it at all) means no call
    of ->put_link().

    b) the body used to be passed to ->put_link() implicitly (via nameidata).
    Now only the opaque pointer is. In the cases when we used the symlink body
    to free stuff, ->follow_link() now should store it as opaque pointer in addition
    to returning it.

    Signed-off-by: Al Viro

    Al Viro
     

06 May, 2015

1 commit

  • We need this earlier in the boot process to allow various subsystems to
    use configfs (e.g Industrial IIO).

    Also, debugfs is at core_initcall level and configfs should be on the same
    level from infrastructure point of view.

    Signed-off-by: Daniel Baluta
    Suggested-by: Lars-Peter Clausen
    Reviewed-by: Christoph Hellwig
    Cc: Al Viro
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Baluta
     

16 Apr, 2015

2 commits


20 Feb, 2015

1 commit

  • Code that does this:

    if (!(d_unhashed(dentry) && dentry->d_inode)) {
    ...
    simple_unlink(parent->d_inode, dentry);
    }

    is broken because:

    !(d_unhashed(dentry) && dentry->d_inode)

    is equivalent to:

    !d_unhashed(dentry) || !dentry->d_inode

    so it is possible to get into simple_unlink() with dentry->d_inode == NULL.

    simple_unlink(), however, assumes dentry->d_inode cannot be NULL.

    I think that what was meant is this:

    !d_unhashed(dentry) && dentry->d_inode

    and that the logical-not operator or the final close-bracket was misplaced.

    Signed-off-by: David Howells
    cc: Joel Becker
    Signed-off-by: Al Viro

    David Howells
     

18 Feb, 2015

3 commits


21 Jan, 2015

2 commits

  • Now that we never use the backing_dev_info pointer in struct address_space
    we can simply remove it and save 4 to 8 bytes in every inode.

    Signed-off-by: Christoph Hellwig
    Acked-by: Ryusuke Konishi
    Reviewed-by: Tejun Heo
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Since "BDI: Provide backing device capability information [try #3]" the
    backing_dev_info structure also provides flags for the kind of mmap
    operation available in a nommu environment, which is entirely unrelated
    to it's original purpose.

    Introduce a new nommu-only file operation to provide this information to
    the nommu mmap code instead. Splitting this from the backing_dev_info
    structure allows to remove lots of backing_dev_info instance that aren't
    otherwise needed, and entirely gets rid of the concept of providing a
    backing_dev_info for a character device. It also removes the need for
    the mtd_inodefs filesystem.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Tejun Heo
    Acked-by: Brian Norris
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

20 Nov, 2014

1 commit


05 Jun, 2014

3 commits


22 Nov, 2013

1 commit

  • A race window in configfs, it starts from one dentry is UNHASHED and end
    before configfs_d_iput is called. In this window, if a lookup happen,
    since the original dentry was UNHASHED, so a new dentry will be
    allocated, and then in configfs_attach_attr(), sd->s_dentry will be
    updated to the new dentry. Then in configfs_d_iput(),
    BUG_ON(sd->s_dentry != dentry) will be triggered and system panic.

    sys_open: sys_close:
    ... fput
    dput
    dentry_kill
    __d_drop dentry still point
    to this dentry.

    lookup_real
    configfs_lookup
    configfs_attach_attr---> update sd->s_dentry
    to new allocated dentry here.

    d_kill
    configfs_d_iput s_dentry != dentry)
    triggered here.

    To fix it, change configfs_d_iput to not update sd->s_dentry if
    sd->s_count > 2, that means there are another dentry is using the sd
    beside the one that is going to be put. Use configfs_dirent_lock in
    configfs_attach_attr to sync with configfs_d_iput.

    With the following steps, you can reproduce the bug.

    1. enable ocfs2, this will mount configfs at /sys/kernel/config and
    fill configure in it.

    2. run the following script.
    while [ 1 ]; do cat /sys/kernel/config/cluster/$your_cluster_name/idle_timeout_ms > /dev/null; done &
    while [ 1 ]; do cat /sys/kernel/config/cluster/$your_cluster_name/idle_timeout_ms > /dev/null; done &

    Signed-off-by: Junxiao Bi
    Cc: Joel Becker
    Cc: Al Viro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junxiao Bi
     

16 Nov, 2013

1 commit


15 Jul, 2013

1 commit

  • Pull more vfs stuff from Al Viro:
    "O_TMPFILE ABI changes, Oleg's fput() series, misc cleanups, including
    making simple_lookup() usable for filesystems with non-NULL s_d_op,
    which allows us to get rid of quite a bit of ugliness"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    sunrpc: now we can just set ->s_d_op
    cgroup: we can use simple_lookup() now
    efivarfs: we can use simple_lookup() now
    make simple_lookup() usable for filesystems that set ->s_d_op
    configfs: don't open-code d_alloc_name()
    __rpc_lookup_create_exclusive: pass string instead of qstr
    rpc_create_*_dir: don't bother with qstr
    llist: llist_add() can use llist_add_batch()
    llist: fix/simplify llist_add() and llist_add_batch()
    fput: turn "list_head delayed_fput_list" into llist_head
    fs/file_table.c:fput(): add comment
    Safer ABI for O_TMPFILE

    Linus Torvalds
     

14 Jul, 2013

1 commit


10 Jul, 2013

1 commit

  • Pull third set of VFS updates from Al Viro:
    "Misc stuff all over the place. There will be one more pile in a
    couple of days"

    This is an "evil merge" that also uses the new d_count helper in
    fs/configfs/dir.c, missed by commit 84d08fa888e7 ("helper for reading
    ->d_count")

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    ncpfs: fix error return code in ncp_parse_options()
    locks: move file_lock_list to a set of percpu hlist_heads and convert file_lock_lock to an lglock
    seq_file: add seq_list_*_percpu helpers
    f2fs: fix readdir incorrectness
    mode_t whack-a-mole...
    lustre: kill the pointless wrapper
    helper for reading ->d_count

    Linus Torvalds
     

04 Jul, 2013

1 commit

  • The difference between "count" and "len" is that "len" is capped at
    4095. Changing it like this makes it match how sysfs_write_file() is
    implemented.

    This is a static analysis patch. I haven't found any store_attribute()
    functions where this change makes a difference.

    Signed-off-by: Dan Carpenter
    Acked-by: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Carpenter
     

29 Jun, 2013

1 commit


04 Mar, 2013

1 commit

  • Modify the request_module to prefix the file system type with "fs-"
    and add aliases to all of the filesystems that can be built as modules
    to match.

    A common practice is to build all of the kernel code and leave code
    that is not commonly needed as modules, with the result that many
    users are exposed to any bug anywhere in the kernel.

    Looking for filesystems with a fs- prefix limits the pool of possible
    modules that can be loaded by mount to just filesystems trivially
    making things safer with no real cost.

    Using aliases means user space can control the policy of which
    filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf
    with blacklist and alias directives. Allowing simple, safe,
    well understood work-arounds to known problematic software.

    This also addresses a rare but unfortunate problem where the filesystem
    name is not the same as it's module name and module auto-loading
    would not work. While writing this patch I saw a handful of such
    cases. The most significant being autofs that lives in the module
    autofs4.

    This is relevant to user namespaces because we can reach the request
    module in get_fs_type() without having any special permissions, and
    people get uncomfortable when a user specified string (in this case
    the filesystem type) goes all of the way to request_module.

    After having looked at this issue I don't think there is any
    particular reason to perform any filtering or permission checks beyond
    making it clear in the module request that we want a filesystem
    module. The common pattern in the kernel is to call request_module()
    without regards to the users permissions. In general all a filesystem
    module does once loaded is call register_filesystem() and go to sleep.
    Which means there is not much attack surface exposed by loading a
    filesytem module unless the filesystem is mounted. In a user
    namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT,
    which most filesystems do not set today.

    Acked-by: Serge Hallyn
    Acked-by: Kees Cook
    Reported-by: Kees Cook
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

27 Feb, 2013

1 commit

  • Pull vfs pile (part one) from Al Viro:
    "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
    locking violations, etc.

    The most visible changes here are death of FS_REVAL_DOT (replaced with
    "has ->d_weak_revalidate()") and a new helper getting from struct file
    to inode. Some bits of preparation to xattr method interface changes.

    Misc patches by various people sent this cycle *and* ocfs2 fixes from
    several cycles ago that should've been upstream right then.

    PS: the next vfs pile will be xattr stuff."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
    saner proc_get_inode() calling conventions
    proc: avoid extra pde_put() in proc_fill_super()
    fs: change return values from -EACCES to -EPERM
    fs/exec.c: make bprm_mm_init() static
    ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
    ocfs2: fix possible use-after-free with AIO
    ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
    get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
    target: writev() on single-element vector is pointless
    export kernel_write(), convert open-coded instances
    fs: encode_fh: return FILEID_INVALID if invalid fid_type
    kill f_vfsmnt
    vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
    nfsd: handle vfs_getattr errors in acl protocol
    switch vfs_getattr() to struct path
    default SET_PERSONALITY() in linux/elf.h
    ceph: prepopulate inodes only when request is aborted
    d_hash_and_lookup(): export, switch open-coded instances
    9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
    9p: split dropping the acls from v9fs_set_create_acl()
    ...

    Linus Torvalds
     

23 Feb, 2013

1 commit


22 Feb, 2013

1 commit


18 Dec, 2012

1 commit


18 Sep, 2012

1 commit


14 Jul, 2012

1 commit

  • Just the flags; only NFS cares even about that, but there are
    legitimate uses for such argument. And getting rid of that
    completely would require splitting ->lookup() into a couple
    of methods (at least), so let's leave that alone for now...

    Signed-off-by: Al Viro

    Al Viro
     

21 Mar, 2012

3 commits