19 Oct, 2007

7 commits

  • If the ATTR_KILL_S*ID bits are set then any mode change is only for clearing
    the setuid/setgid bits. For CIFS, skip the mode change and let the server
    handle it.

    Signed-off-by: Jeff Layton
    Cc: Steven French
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Layton
     
  • If the ATTR_KILL_S*ID bits are set then any mode change is only for clearing
    the setuid/setgid bits. For NFS, skip the mode change and let the server
    handle it.

    Signed-off-by: Jeff Layton
    Cc: Trond Myklebust
    Cc: "J. Bruce Fields"
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Layton
     
  • When an unprivileged process attempts to modify a file that has the setuid or
    setgid bits set, the VFS will attempt to clear these bits. The VFS will set
    the ATTR_KILL_SUID or ATTR_KILL_SGID bits in the ia_valid mask, and then call
    notify_change to clear these bits and set the mode accordingly.

    With a networked filesystem (NFS and CIFS in particular but likely others),
    the client machine or process may not have credentials that allow for setting
    the mode. In some situations, this can lead to file corruption, an operation
    failing outright because the setattr fails, or to races that lead to a mode
    change being reverted.

    In this situation, we'd like to just leave the handling of this to the server
    and ignore these bits. The problem is that by the time the setattr op is
    called, the VFS has already reinterpreted the ATTR_KILL_* bits into a mode
    change. The setattr operation has no way to know its intent.

    The following patch fixes this by making notify_change no longer clear the
    ATTR_KILL_SUID and ATTR_KILL_SGID bits in the ia_valid before handing it off
    to the setattr inode op. setattr can then check for the presence of these
    bits, and if they're set it can assume that the mode change was only for the
    purposes of clearing these bits.

    This means that we now have an implicit assumption that notify_change is never
    called with ATTR_MODE and either ATTR_KILL_S*ID bit set. Nothing currently
    enforces that, so this patch also adds a BUG() if that occurs.

    Signed-off-by: Jeff Layton
    Cc: Michael Halcrow
    Cc: Christoph Hellwig
    Cc: Neil Brown
    Cc: "J. Bruce Fields"
    Cc: Chris Mason
    Cc: Jeff Mahoney
    Cc: "Vladimir V. Saveliev"
    Cc: Josef 'Jeff' Sipek
    Cc: Trond Myklebust
    Cc: Steven French
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Layton
     
  • reiserfs_setattr can call notify_change recursively using the same
    iattr struct. This could cause it to trip the BUG() in notify_change.
    Fix reiserfs to clear those bits near the beginning of the function.

    Signed-off-by: Jeff Layton
    Cc: Chris Mason
    Cc: Jeff Mahoney
    Cc: "Vladimir V. Saveliev"
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Layton
     
  • It's theoretically possible for a single SETATTR call to come in that sets the
    mode and the uid/gid. In that case, don't set the ATTR_KILL_S*ID bits since
    that would trip the BUG() in notify_change. Just fix up the mode to have the
    same effect.

    Signed-off-by: Jeff Layton
    Cc: Neil Brown
    Cc: "J. Bruce Fields"
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Layton
     
  • Make sure ecryptfs doesn't trip the BUG() in notify_change. This also allows
    the lower filesystem to interpret ATTR_KILL_S*ID in its own way.

    Signed-off-by: Jeff Layton
    Cc: Michael Halcrow
    Cc: Christoph Hellwig
    Cc: Neil Brown
    Cc: "J. Bruce Fields"
    Cc: Chris Mason
    Cc: Jeff Mahoney
    Cc: "Vladimir V. Saveliev"
    Cc: Josef 'Jeff' Sipek
    Cc: Trond Myklebust
    Cc: Steven French
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Layton
     
  • Hell knows what happened in commit 63b05203af57e7de4f3bb63b8b81d43bc196d32b
    during 2.6.9 development. Commit introduced io_wait field which remained
    write-only than and still remains write-only.

    Also garbage collect macros which "use" io_wait.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

18 Oct, 2007

29 commits

  • When resizing online, setup_new_group_blocks attempts to reserve a
    potentially very large transaction, depending on the current filesystem
    geometry. For some journal sizes, there may not be enough room for this
    transaction, and the online resize will fail.

    The patch below resizes & restarts the transaction as necessary while
    setting up the new group, and should work with even the smallest journal.

    Tested with something like:

    [root@newbox ~]# dd if=/dev/zero of=fsfile bs=1024 count=32768
    [root@newbox ~]# mkfs.ext3 -b 1024 fsfile 16384
    [root@newbox ~]# mount -o loop fsfile mnt/
    [root@newbox ~]# resize2fs /dev/loop0
    resize2fs 1.40.2 (12-Jul-2007)
    Filesystem at /dev/loop0 is mounted on /root/mnt; on-line resizing required
    old desc_blocks = 1, new_desc_blocks = 1
    Performing an on-line resize of /dev/loop0 to 32768 (1k) blocks.
    resize2fs: No space left on device While trying to add group #2
    [root@newbox ~]# dmesg | tail -n 1
    JBD: resize2fs wants too many credits (258 > 256)
    [root@newbox ~]#

    With the below change, it works.

    Signed-off-by: Eric Sandeen
    Signed-off-by: Mingming Cao
    Acked-by: Andreas Dilger

    Eric Sandeen
     
  • setup_new_group_blocks() manipulates the group descriptor block bh
    under the block_bitmap bh's lock. It shouldn't matter since nobody
    but resize should be touching these blocks, but it's worth fixing up.

    Signed-off-by: Eric Sandeen
    Signed-off-by: Mingming Cao

    Eric Sandeen
     
  • Signed-off-by: Aneesh Kumar K.V

    Aneesh Kumar K.V
     
  • Convert ext4_extent_idx.ei_leaf ext4_extent_idx.ei_leaf_lo
    This helps in finding BUGs due to direct partial access of
    these split 48 bit values.

    Signed-off-by: Aneesh Kumar K.V

    Aneesh Kumar K.V
     
  • Convert ext4_extent.ee_start to ext4_extent.ee_start_lo
    This helps in finding BUGs due to direct partial access of
    these split 48 bit values

    Also fix direct partial access in ext4 code

    Signed-off-by: Aneesh Kumar K.V

    Aneesh Kumar K.V
     
  • Convert s_r_blocks_count and s_free_blocks_count to
    s_r_blocks_count_lo and s_free_blocks_count_lo

    This helps in finding BUGs due to direct partial access of
    these split 64 bit values

    Also fix direct partial access in ext4 code

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: "Theodore Ts'o"

    Aneesh Kumar K.V
     
  • Convert s_blocks_count to s_blocks_count_lo
    This helps in finding BUGs due to direct partial access of
    these split 64 bit values

    Also fix direct partial access in ext4 code

    Signed-off-by: Aneesh Kumar K.V

    Aneesh Kumar K.V
     
  • Convert bg_inode_bitmap and bg_inode_table to bg_inode_bitmap_lo
    and bg_inode_table_lo. This helps in finding BUGs due to
    direct partial access of these split 64 bit values

    Also fix one direct partial access

    Signed-off-by: Aneesh Kumar K.V

    Aneesh Kumar K.V
     
  • Convert bg_block_bitmap to bg_block_bitmap_lo
    This helps in catching some BUGS due to direct
    partial access of these split fields.

    Signed-off-by: Aneesh Kumar K.V

    Aneesh Kumar K.V
     
  • This feature relaxes check restrictions on where each block groups meta
    data is located within the storage media. This allows for the allocation
    of bitmaps or inode tables outside the block group boundaries in cases
    where bad blocks forces us to look for new blocks which the owning block
    group can not satisfy. This will also allow for new meta-data allocation
    schemes to improve performance and scalability.

    Signed-off-by: Jose R. Santos
    Cc:
    Signed-off-by: Andrew Morton

    Jose R. Santos
     
  • Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Andrew Morton

    Aneesh Kumar K.V
     
  • In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
    regardless of whether it is in use. This is this the most time consuming part
    of the filesystem check. The unintialized block group feature can greatly
    reduce e2fsck time by eliminating checking of uninitialized inodes.

    With this feature, there is a a high water mark of used inodes for each block
    group. Block and inode bitmaps can be uninitialized on disk via a flag in the
    group descriptor to avoid reading or scanning them at e2fsck time. A checksum
    of each group descriptor is used to ensure that corruption in the group
    descriptor's bit flags does not cause incorrect operation.

    The feature is enabled through a mkfs option

    mke2fs /dev/ -O uninit_groups

    A patch adding support for uninitialized block groups to e2fsprogs tools has
    been posted to the linux-ext4 mailing list.

    The patches have been stress tested with fsstress and fsx. In performance
    tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
    linearly with the total number of inodes in the filesytem. In ext4 with the
    uninitialized block groups feature, the e2fsck time is constant, based
    solely on the number of used inodes rather than the total inode count.
    Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
    greatly reduce e2fsck time for users. With performance improvement of 2-20
    times, depending on how full the filesystem is.

    The attached graph shows the major improvements in e2fsck times in filesystems
    with a large total inode count, but few inodes in use.

    In each group descriptor if we have

    EXT4_BG_INODE_UNINIT set in bg_flags:
    Inode table is not initialized/used in this group. So we can skip
    the consistency check during fsck.
    EXT4_BG_BLOCK_UNINIT set in bg_flags:
    No block in the group is used. So we can skip the block bitmap
    verification for this group.

    We also add two new fields to group descriptor as a part of
    uninitialized group patch.

    __le16 bg_itable_unused; /* Unused inodes count */
    __le16 bg_checksum; /* crc16(sb_uuid+group+desc) */

    bg_itable_unused:

    If we have EXT4_BG_INODE_UNINIT not set in bg_flags
    then bg_itable_unused will give the offset within
    the inode table till the inodes are used. This can be
    used by fsck to skip list of inodes that are marked unused.

    bg_checksum:
    Now that we depend on bg_flags and bg_itable_unused to determine
    the block and inode usage, we need to make sure group descriptor
    is not corrupt. We add checksum to group descriptor to
    detect corruption. If the descriptor is found to be corrupt, we
    mark all the blocks and inodes in the group used.

    Signed-off-by: Avantika Mathur
    Signed-off-by: Andreas Dilger
    Signed-off-by: Mingming Cao
    Signed-off-by: Aneesh Kumar K.V

    Andreas Dilger
     
  • CONFIG_EXT4_INDEX is not an exposed config option in the kernel, and it is
    unconditionally defined in ext4_fs.h. tune2fs is already able to turn off
    dir indexing, so at this point it's just cluttering up the code. Remove
    it.

    Signed-off-by: Eric Sandeen
    Signed-off-by: Mingming Cao
    Signed-off-by: "Theodore Ts'o"
    Signed-off-by: Andrew Morton

    Eric Sandeen
     
  • Fragment support in ext2/3/4 was never implemented, and it probably will
    never be implemented. So remove it from ext4.

    Signed-off-by: Coly Li
    Acked-by: Andreas Dilger
    Signed-off-by: Andrew Morton
    Signed-off-by: "Theodore Ts'o"

    Coly Li
     
  • Mostly stolen from akpm's JBD cleanup patch.

    - use `#ifdef foo' instead of `#if defined(foo)'

    - Make journal_enable_debug __read_mostly just for the heck of it

    - Make jbd_debugfs_dir and jbd_debug static

    - debugfs_remove(NULL) is legal: remove unneeded tests

    - remove unnecessary empty loops

    Signed-off-by: Jose R. Santos
    Cc:
    Signed-off-by: Andrew Morton

    Jose R. Santos
     
  • We should really call journal_abort() and not __journal_abort_hard() in
    case of errors. The latter call does not record the error in the journal
    superblock and thus filesystem won't be marked as with errors later (and
    user could happily mount it without any warning).

    Signed-off-by: Jan Kara
    Cc:
    Signed-off-by: Andrew Morton

    Jan Kara
     
  • change JBD_XXX macros to JBD2_XXX in JBD2/Ext4

    Signed-off-by: Mingming Cao
    Signed-off-by: "Theodore Ts'o"

    Mingming Cao
     
  • Convert kmalloc to kzalloc() and get rid of the memset().

    Signed-off-by: Mingming Cao

    Mingming Cao
     
  • This patch cleans up jbd_kmalloc and replace it with kmalloc directly

    Signed-off-by: Mingming Cao

    Mingming Cao
     
  • This patch cleans up jbd_kmalloc and replace it with kmalloc directly

    Signed-off-by: Mingming Cao

    Mingming Cao
     
  • JBD2: Replace slab allocations with page allocations

    JBD2 allocate memory for committed_data and frozen_data from slab. However
    JBD2 should not pass slab pages down to the block layer. Use page allocator
    pages instead. This will also prepare JBD for the large blocksize patchset.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Mingming Cao

    Mingming Cao
     
  • JBD: Replace slab allocations with page allocations

    JBD allocate memory for committed_data and frozen_data from slab. However
    JBD should not pass slab pages down to the block layer. Use page allocator pages instead. This will also prepare JBD for the large blocksize patchset.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Mingming Cao

    Mingming Cao
     
  • This patch moves transport dynamic registration and matching to the net
    module to prevent a bad Kconfig dependency between the net and fs 9p modules.

    Signed-off-by: Eric Van Hensbergen

    Eric Van Hensbergen
     
  • Loose mode in 9p utilizes the page cache without respecting coherency with
    the server. Any writes previously invaldiated the entire mapping for a file.
    This patch softens the behavior to only invalidate the region of the actual
    write.

    Signed-off-by: Eric Van Hensbergen

    Eric Van Hensbergen
     
  • The 9P2000 protocol requires the authentication and permission checks to be
    done in the file server. For that reason every user that accesses the file
    server tree has to authenticate and attach to the server separately.
    Multiple users can share the same connection to the server.

    Currently v9fs does a single attach and executes all I/O operations as a
    single user. This makes using v9fs in multiuser environment unsafe as it
    depends on the client doing the permission checking.

    This patch improves the 9P2000 support by allowing every user to attach
    separately. The patch defines three modes of access (new mount option
    'access'):

    - attach-per-user (access=user) (default mode for 9P2000.u)
    If a user tries to access a file served by v9fs for the first time, v9fs
    sends an attach command to the server (Tattach) specifying the user. If
    the attach succeeds, the user can access the v9fs tree.
    As there is no uname->uid (string->integer) mapping yet, this mode works
    only with the 9P2000.u dialect.

    - allow only one user to access the tree (access=)
    Only the user with uid can access the v9fs tree. Other users that attempt
    to access it will get EPERM error.

    - do all operations as a single user (access=any) (default for 9P2000)
    V9fs does a single attach and all operations are done as a single user.
    If this mode is selected, the v9fs behavior is identical with the current
    one.

    Signed-off-by: Latchesar Ionkov
    Signed-off-by: Eric Van Hensbergen

    Latchesar Ionkov
     
  • Change the names of 'uid' and 'gid' parameters to the more appropriate
    'dfltuid' and 'dfltgid'. This also sets the default uid/gid to -2
    (aka nfsnobody)

    Signed-off-by: Latchesar Ionkov
    Signed-off-by: Eric Van Hensbergen

    Latchesar Ionkov
     
  • Create more general flags field in the v9fs_session_info struct and move the
    'extended' flag as a bit in the flags.

    Signed-off-by: Latchesar Ionkov
    Signed-off-by: Eric Van Hensbergen

    Latchesar Ionkov
     
  • This patch abstracts out the interfaces to underlying transports so that
    new transports can be added as modules. This should also allow kernel
    configuration of transports without ifdef-hell.

    Signed-off-by: Eric Van Hensbergen

    Eric Van Hensbergen
     
  • * 'for-linus' of git://oss.sgi.com:8090/xfs/xfs-2.6: (59 commits)
    [XFS] eagerly remove vmap mappings to avoid upsetting Xen
    [XFS] simplify validata_fields
    [XFS] no longer using io_vnode, as was remaining from 23 cherrypick
    [XFS] Remove STATIC which was missing from prior manual merge
    [XFS] Put back the QUEUE_ORDERED_NONE test in the barrier check.
    [XFS] Turn off XBF_ASYNC flag before re-reading superblock.
    [XFS] avoid race in sync_inodes() that can fail to write out all dirty data
    [XFS] This fix prevents bulkstat from spinning in an infinite loop.
    [XFS] simplify xfs_create/mknod/symlink prototype
    [XFS] avoid xfs_getattr in XFS_IOC_FSGETXATTR ioctl
    [XFS] get_bulkall() could return incorrect inode state
    [XFS] Kill unused IOMAP_EOF flag
    [XFS] fix when DMAPI mount option processing happens
    [XFS] ensure file size is logged on synchronous writes
    [XFS] growlock should be a mutex
    [XFS] replace some large xfs_log_priv.h macros by proper functions
    [XFS] kill struct bhv_vfs
    [XFS] move syncing related members from struct bhv_vfs to struct xfs_mount
    [XFS] kill the vfs_flags member in struct bhv_vfs
    [XFS] kill the vfs_fsid and vfs_altfsid members in struct bhv_vfs
    ...

    Linus Torvalds
     

17 Oct, 2007

4 commits

  • This patch contains the following cleanups that are now possible:
    - remove the unused security_operations->inode_xattr_getsuffix
    - remove the no longer used security_operations->unregister_security
    - remove some no longer required exit code
    - remove a bunch of no longer used exports

    Signed-off-by: Adrian Bunk
    Acked-by: James Morris
    Cc: Chris Wright
    Cc: Stephen Smalley
    Cc: Serge Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • Implement file posix capabilities. This allows programs to be given a
    subset of root's powers regardless of who runs them, without having to use
    setuid and giving the binary all of root's powers.

    This version works with Kaigai Kohei's userspace tools, found at
    http://www.kaigai.gr.jp/index.php. For more information on how to use this
    patch, Chris Friedhoff has posted a nice page at
    http://www.friedhoff.org/fscaps.html.

    Changelog:
    Nov 27:
    Incorporate fixes from Andrew Morton
    (security-introduce-file-caps-tweaks and
    security-introduce-file-caps-warning-fix)
    Fix Kconfig dependency.
    Fix change signaling behavior when file caps are not compiled in.

    Nov 13:
    Integrate comments from Alexey: Remove CONFIG_ ifdef from
    capability.h, and use %zd for printing a size_t.

    Nov 13:
    Fix endianness warnings by sparse as suggested by Alexey
    Dobriyan.

    Nov 09:
    Address warnings of unused variables at cap_bprm_set_security
    when file capabilities are disabled, and simultaneously clean
    up the code a little, by pulling the new code into a helper
    function.

    Nov 08:
    For pointers to required userspace tools and how to use
    them, see http://www.friedhoff.org/fscaps.html.

    Nov 07:
    Fix the calculation of the highest bit checked in
    check_cap_sanity().

    Nov 07:
    Allow file caps to be enabled without CONFIG_SECURITY, since
    capabilities are the default.
    Hook cap_task_setscheduler when !CONFIG_SECURITY.
    Move capable(TASK_KILL) to end of cap_task_kill to reduce
    audit messages.

    Nov 05:
    Add secondary calls in selinux/hooks.c to task_setioprio and
    task_setscheduler so that selinux and capabilities with file
    cap support can be stacked.

    Sep 05:
    As Seth Arnold points out, uid checks are out of place
    for capability code.

    Sep 01:
    Define task_setscheduler, task_setioprio, cap_task_kill, and
    task_setnice to make sure a user cannot affect a process in which
    they called a program with some fscaps.

    One remaining question is the note under task_setscheduler: are we
    ok with CAP_SYS_NICE being sufficient to confine a process to a
    cpuset?

    It is a semantic change, as without fsccaps, attach_task doesn't
    allow CAP_SYS_NICE to override the uid equivalence check. But since
    it uses security_task_setscheduler, which elsewhere is used where
    CAP_SYS_NICE can be used to override the uid equivalence check,
    fixing it might be tough.

    task_setscheduler
    note: this also controls cpuset:attach_task. Are we ok with
    CAP_SYS_NICE being used to confine to a cpuset?
    task_setioprio
    task_setnice
    sys_setpriority uses this (through set_one_prio) for another
    process. Need same checks as setrlimit

    Aug 21:
    Updated secureexec implementation to reflect the fact that
    euid and uid might be the same and nonzero, but the process
    might still have elevated caps.

    Aug 15:
    Handle endianness of xattrs.
    Enforce capability version match between kernel and disk.
    Enforce that no bits beyond the known max capability are
    set, else return -EPERM.
    With this extra processing, it may be worth reconsidering
    doing all the work at bprm_set_security rather than
    d_instantiate.

    Aug 10:
    Always call getxattr at bprm_set_security, rather than
    caching it at d_instantiate.

    [morgan@kernel.org: file-caps clean up for linux/capability.h]
    [bunk@kernel.org: unexport cap_inode_killpriv]
    Signed-off-by: Serge E. Hallyn
    Cc: Stephen Smalley
    Cc: James Morris
    Cc: Chris Wright
    Cc: Andrew Morgan
    Signed-off-by: Andrew Morgan
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • I'm going to be modifying nfsd_rename() shortly to support read-only bind
    mounts. This #ifdef is around the area I'm patching, and it starts to get
    really ugly if I just try to add my new code by itself. Using this little
    helper makes things a lot cleaner to use.

    Signed-off-by: Dave Hansen
    Acked-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • First of all, this makes the structure jumping look a little bit cleaner. So,
    this stands alone as a tiny cleanup. But, we also need 'mnt' by itself a few
    more times later in this series, so this isn't _just_ a cleanup.

    Signed-off-by: Dave Hansen
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen