04 Mar, 2013

1 commit

  • Modify the request_module to prefix the file system type with "fs-"
    and add aliases to all of the filesystems that can be built as modules
    to match.

    A common practice is to build all of the kernel code and leave code
    that is not commonly needed as modules, with the result that many
    users are exposed to any bug anywhere in the kernel.

    Looking for filesystems with a fs- prefix limits the pool of possible
    modules that can be loaded by mount to just filesystems trivially
    making things safer with no real cost.

    Using aliases means user space can control the policy of which
    filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf
    with blacklist and alias directives. Allowing simple, safe,
    well understood work-arounds to known problematic software.

    This also addresses a rare but unfortunate problem where the filesystem
    name is not the same as it's module name and module auto-loading
    would not work. While writing this patch I saw a handful of such
    cases. The most significant being autofs that lives in the module
    autofs4.

    This is relevant to user namespaces because we can reach the request
    module in get_fs_type() without having any special permissions, and
    people get uncomfortable when a user specified string (in this case
    the filesystem type) goes all of the way to request_module.

    After having looked at this issue I don't think there is any
    particular reason to perform any filtering or permission checks beyond
    making it clear in the module request that we want a filesystem
    module. The common pattern in the kernel is to call request_module()
    without regards to the users permissions. In general all a filesystem
    module does once loaded is call register_filesystem() and go to sleep.
    Which means there is not much attack surface exposed by loading a
    filesytem module unless the filesystem is mounted. In a user
    namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT,
    which most filesystems do not set today.

    Acked-by: Serge Hallyn
    Acked-by: Kees Cook
    Reported-by: Kees Cook
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

23 Feb, 2013

1 commit


21 Dec, 2012

1 commit


03 Oct, 2012

2 commits

  • Pull vfs update from Al Viro:

    - big one - consolidation of descriptor-related logics; almost all of
    that is moved to fs/file.c

    (BTW, I'm seriously tempted to rename the result to fd.c. As it is,
    we have a situation when file_table.c is about handling of struct
    file and file.c is about handling of descriptor tables; the reasons
    are historical - file_table.c used to be about a static array of
    struct file we used to have way back).

    A lot of stray ends got cleaned up and converted to saner primitives,
    disgusting mess in android/binder.c is still disgusting, but at least
    doesn't poke so much in descriptor table guts anymore. A bunch of
    relatively minor races got fixed in process, plus an ext4 struct file
    leak.

    - related thing - fget_light() partially unuglified; see fdget() in
    there (and yes, it generates the code as good as we used to have).

    - also related - bits of Cyrill's procfs stuff that got entangled into
    that work; _not_ all of it, just the initial move to fs/proc/fd.c and
    switch of fdinfo to seq_file.

    - Alex's fs/coredump.c spiltoff - the same story, had been easier to
    take that commit than mess with conflicts. The rest is a separate
    pile, this was just a mechanical code movement.

    - a few misc patches all over the place. Not all for this cycle,
    there'll be more (and quite a few currently sit in akpm's tree)."

    Fix up trivial conflicts in the android binder driver, and some fairly
    simple conflicts due to two different changes to the sock_alloc_file()
    interface ("take descriptor handling from sock_alloc_file() to callers"
    vs "net: Providing protocol type via system.sockprotoname xattr of
    /proc/PID/fd entries" adding a dentry name to the socket)

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits)
    MAX_LFS_FILESIZE should be a loff_t
    compat: fs: Generic compat_sys_sendfile implementation
    fs: push rcu_barrier() from deactivate_locked_super() to filesystems
    btrfs: reada_extent doesn't need kref for refcount
    coredump: move core dump functionality into its own file
    coredump: prevent double-free on an error path in core dumper
    usb/gadget: fix misannotations
    fcntl: fix misannotations
    ceph: don't abuse d_delete() on failure exits
    hypfs: ->d_parent is never NULL or negative
    vfs: delete surplus inode NULL check
    switch simple cases of fget_light to fdget
    new helpers: fdget()/fdput()
    switch o2hb_region_dev_write() to fget_light()
    proc_map_files_readdir(): don't bother with grabbing files
    make get_file() return its argument
    vhost_set_vring(): turn pollstart/pollstop into bool
    switch prctl_set_mm_exe_file() to fget_light()
    switch xfs_find_handle() to fget_light()
    switch xfs_swapext() to fget_light()
    ...

    Linus Torvalds
     
  • There's no reason to call rcu_barrier() on every
    deactivate_locked_super(). We only need to make sure that all delayed rcu
    free inodes are flushed before we destroy related cache.

    Removing rcu_barrier() from deactivate_locked_super() affects some fast
    paths. E.g. on my machine exit_group() of a last process in IPC
    namespace takes 0.07538s. rcu_barrier() takes 0.05188s of that time.

    Signed-off-by: Kirill A. Shutemov
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Kirill A. Shutemov
     

21 Sep, 2012

1 commit


31 Jul, 2012

1 commit

  • On minix2 and minix3 usually max_size is 7fffffff and the check in
    question prohibits creation of last block spanning right before 7fffffff,
    due to downward rounding during the division. Fix it by using
    multiplication instead.

    [akpm@linux-foundation.org: fix up code layout, use local `sb']
    Signed-off-by: Vladimir Serbinenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Serbinenko
     

14 Jul, 2012

2 commits

  • boolean "does it have to be exclusive?" flag is passed instead;
    Local filesystem should just ignore it - the object is guaranteed
    not to be there yet.

    Signed-off-by: Al Viro

    Al Viro
     
  • Just the flags; only NFS cares even about that, but there are
    legitimate uses for such argument. And getting rid of that
    completely would require splitting ->lookup() into a couple
    of methods (at least), so let's leave that alone for now...

    Signed-off-by: Al Viro

    Al Viro
     

06 May, 2012

1 commit

  • After we moved inode_sync_wait() from end_writeback() it doesn't make sense
    to call the function end_writeback() anymore. Rename it to clear_inode()
    which well says what the function really does - set I_CLEAR flag.

    Signed-off-by: Jan Kara
    Signed-off-by: Fengguang Wu

    Jan Kara
     

22 Mar, 2012

1 commit

  • Pull vfs pile 1 from Al Viro:
    "This is _not_ all; in particular, Miklos' and Jan's stuff is not there
    yet."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (64 commits)
    ext4: initialization of ext4_li_mtx needs to be done earlier
    debugfs-related mode_t whack-a-mole
    hfsplus: add an ioctl to bless files
    hfsplus: change finder_info to u32
    hfsplus: initialise userflags
    qnx4: new helper - try_extent()
    qnx4: get rid of qnx4_bread/qnx4_getblk
    take removal of PF_FORKNOEXEC to flush_old_exec()
    trim includes in inode.c
    um: uml_dup_mmap() relies on ->mmap_sem being held, but activate_mm() doesn't hold it
    um: embed ->stub_pages[] into mmu_context
    gadgetfs: list_for_each_safe() misuse
    ocfs2: fix leaks on failure exits in module_init
    ecryptfs: make register_filesystem() the last potential failure exit
    ntfs: forgets to unregister sysctls on register_filesystem() failure
    logfs: missing cleanup on register_filesystem() failure
    jfs: mising cleanup on register_filesystem() failure
    make configfs_pin_fs() return root dentry on success
    configfs: configfs_create_dir() has parent dentry in dentry->d_parent
    configfs: sanitize configfs_create()
    ...

    Linus Torvalds
     

21 Mar, 2012

2 commits


20 Mar, 2012

1 commit


09 Jan, 2012

1 commit

  • * 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (165 commits)
    reiserfs: Properly display mount options in /proc/mounts
    vfs: prevent remount read-only if pending removes
    vfs: count unlinked inodes
    vfs: protect remounting superblock read-only
    vfs: keep list of mounts for each superblock
    vfs: switch ->show_options() to struct dentry *
    vfs: switch ->show_path() to struct dentry *
    vfs: switch ->show_devname() to struct dentry *
    vfs: switch ->show_stats to struct dentry *
    switch security_path_chmod() to struct path *
    vfs: prefer ->dentry->d_sb to ->mnt->mnt_sb
    vfs: trim includes a bit
    switch mnt_namespace ->root to struct mount
    vfs: take /proc/*/mounts and friends to fs/proc_namespace.c
    vfs: opencode mntget() mnt_set_mountpoint()
    vfs: spread struct mount - remaining argument of next_mnt()
    vfs: move fsnotify junk to struct mount
    vfs: move mnt_devname
    vfs: move mnt_list to struct mount
    vfs: switch pnode.h macros to struct mount *
    ...

    Linus Torvalds
     

05 Jan, 2012

1 commit

  • bitmap size sanity checks should be done *before* allocating ->s_root;
    there their cleanup on failure would be correct. As it is, we do iput()
    on root inode, but leak the root dentry...

    Signed-off-by: Al Viro
    Acked-by: Josh Boyer
    Signed-off-by: Linus Torvalds

    Al Viro
     

04 Jan, 2012

5 commits


20 Nov, 2011

2 commits

  • Signed-off-by: Al Viro

    Al Viro
     
  • Newer versions of MINIX can create filesystems that allocate an extra
    bitmap block. Mounting of this succeeds, but doing a statfs call will
    result in an oops in count_free because of a negative number being used
    for the bh index.

    Avoid this by verifying the number of allocated blocks at mount time,
    erroring out if there are not enough and make statfs ignore the extras
    if there are too many.

    This fixes https://bugzilla.kernel.org/show_bug.cgi?id=18792

    Signed-off-by: Josh Boyer
    Signed-off-by: Al Viro

    Josh Boyer
     

02 Nov, 2011

1 commit


21 Jul, 2011

1 commit


28 May, 2011

1 commit


26 May, 2011

2 commits


25 Mar, 2011

1 commit

  • * 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits)
    Documentation/iostats.txt: bit-size reference etc.
    cfq-iosched: removing unnecessary think time checking
    cfq-iosched: Don't clear queue stats when preempt.
    blk-throttle: Reset group slice when limits are changed
    blk-cgroup: Only give unaccounted_time under debug
    cfq-iosched: Don't set active queue in preempt
    block: fix non-atomic access to genhd inflight structures
    block: attempt to merge with existing requests on plug flush
    block: NULL dereference on error path in __blkdev_get()
    cfq-iosched: Don't update group weights when on service tree
    fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
    block: Require subsystems to explicitly allocate bio_set integrity mempool
    jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    fs: make fsync_buffers_list() plug
    mm: make generic_writepages() use plugging
    blk-cgroup: Add unaccounted time to timeslice_used.
    block: fixup plugging stubs for !CONFIG_BLOCK
    block: remove obsolete comments for blkdev_issue_zeroout.
    blktrace: Use rq->cmd_flags directly in blk_add_trace_rq.
    ...

    Fix up conflicts in fs/{aio.c,super.c}

    Linus Torvalds
     

24 Mar, 2011

1 commit

  • minix bit operations are only used by minix filesystem and useless by
    other modules. Because byte order of inode and block bitmaps is different
    on each architecture like below:

    m68k:
    big-endian 16bit indexed bitmaps

    h8300, microblaze, s390, sparc, m68knommu:
    big-endian 32 or 64bit indexed bitmaps

    m32r, mips, sh, xtensa:
    big-endian 32 or 64bit indexed bitmaps for big-endian mode
    little-endian bitmaps for little-endian mode

    Others:
    little-endian bitmaps

    In order to move minix bit operations from asm/bitops.h to architecture
    independent code in minix filesystem, this provides two config options.

    CONFIG_MINIX_FS_BIG_ENDIAN_16BIT_INDEXED is only selected by m68k.
    CONFIG_MINIX_FS_NATIVE_ENDIAN is selected by the architectures which use
    native byte order bitmaps (h8300, microblaze, s390, sparc, m68knommu,
    m32r, mips, sh, xtensa). The architectures which always use little-endian
    bitmaps do not select these options.

    Finally, we can remove minix bit operations from asm/bitops.h for all
    architectures.

    Signed-off-by: Akinobu Mita
    Acked-by: Arnd Bergmann
    Acked-by: Greg Ungerer
    Cc: Geert Uytterhoeven
    Cc: Roman Zippel
    Cc: Andreas Schwab
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Yoshinori Sato
    Cc: Michal Simek
    Cc: "David S. Miller"
    Cc: Hirokazu Takata
    Acked-by: Ralf Baechle
    Acked-by: Paul Mundt
    Cc: Chris Zankel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     

10 Mar, 2011

1 commit

  • Code has been converted over to the new explicit on-stack plugging,
    and delay users have been converted to use the new API for that.
    So lets kill off the old plugging along with aops->sync_page().

    Signed-off-by: Jens Axboe

    Jens Axboe
     

03 Mar, 2011

1 commit


13 Jan, 2011

1 commit


07 Jan, 2011

2 commits

  • Reduce some branches and memory accesses in dcache lookup by adding dentry
    flags to indicate common d_ops are set, rather than having to check them.
    This saves a pointer memory access (dentry->d_op) in common path lookup
    situations, and saves another pointer load and branch in cases where we
    have d_op but not the particular operation.

    Patched with:

    git grep -E '[.>]([[:space:]])*d_op([[:space:]])*=' | xargs sed -e 's/\([^\t ]*\)->d_op = \(.*\);/d_set_d_op(\1, \2);/' -e 's/\([^\t ]*\)\.d_op = \(.*\);/d_set_d_op(\&\1, \2);/' -i

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • RCU free the struct inode. This will allow:

    - Subsequent store-free path walking patch. The inode must be consulted for
    permissions when walking, so an RCU inode reference is a must.
    - sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
    to take i_lock no longer need to take sb_inode_list_lock to walk the list in
    the first place. This will simplify and optimize locking.
    - Could remove some nested trylock loops in dcache code
    - Could potentially simplify things a bit in VM land. Do not need to take the
    page lock to follow page->mapping.

    The downsides of this is the performance cost of using RCU. In a simple
    creat/unlink microbenchmark, performance drops by about 10% due to inability to
    reuse cache-hot slab objects. As iterations increase and RCU freeing starts
    kicking over, this increases to about 20%.

    In cases where inode lifetimes are longer (ie. many inodes may be allocated
    during the average life span of a single inode), a lot of this cache reuse is
    not applicable, so the regression caused by this patch is smaller.

    The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
    however this adds some complexity to list walking and store-free path walking,
    so I prefer to implement this at a later date, if it is shown to be a win in
    real situations. I haven't found a regression in any non-micro benchmark so I
    doubt it will be a problem.

    Signed-off-by: Nick Piggin

    Nick Piggin
     

29 Oct, 2010

1 commit


26 Oct, 2010

1 commit


10 Sep, 2010

1 commit

  • Commit 9eed1fb721c ("minix: replace inode uid,gid,mode init with helper")
    broke directory creation on minix filesystems.

    Fix it by passing the needed mode flag to inode init helper.

    Signed-off-by: Jorge Boncompte [DTI2]
    Cc: Dmitry Monakhov
    Cc: Al Viro
    Cc: [2.6.35.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jorge Boncompte [DTI2]
     

10 Aug, 2010

2 commits

  • We need to wait for completion of possible writeback in progress
    before we clear on-disk inode during deletion.

    Signed-off-by: Al Viro

    Al Viro
     
  • Replace inode_setattr with opencoded variants of it in all callers. This
    moves the remaining call to vmtruncate into the filesystem methods where it
    can be replaced with the proper truncate sequence.

    In a few cases it was obvious that we would never end up calling vmtruncate
    so it was left out in the opencoded variant:

    spufs: explicitly checks for ATTR_SIZE earlier
    btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier
    ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above

    In addition to that ncpfs called inode_setattr with handcrafted iattrs,
    which allowed to trim down the opencoded variant.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig