04 Mar, 2013

1 commit

  • Modify the request_module to prefix the file system type with "fs-"
    and add aliases to all of the filesystems that can be built as modules
    to match.

    A common practice is to build all of the kernel code and leave code
    that is not commonly needed as modules, with the result that many
    users are exposed to any bug anywhere in the kernel.

    Looking for filesystems with a fs- prefix limits the pool of possible
    modules that can be loaded by mount to just filesystems trivially
    making things safer with no real cost.

    Using aliases means user space can control the policy of which
    filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf
    with blacklist and alias directives. Allowing simple, safe,
    well understood work-arounds to known problematic software.

    This also addresses a rare but unfortunate problem where the filesystem
    name is not the same as it's module name and module auto-loading
    would not work. While writing this patch I saw a handful of such
    cases. The most significant being autofs that lives in the module
    autofs4.

    This is relevant to user namespaces because we can reach the request
    module in get_fs_type() without having any special permissions, and
    people get uncomfortable when a user specified string (in this case
    the filesystem type) goes all of the way to request_module.

    After having looked at this issue I don't think there is any
    particular reason to perform any filtering or permission checks beyond
    making it clear in the module request that we want a filesystem
    module. The common pattern in the kernel is to call request_module()
    without regards to the users permissions. In general all a filesystem
    module does once loaded is call register_filesystem() and go to sleep.
    Which means there is not much attack surface exposed by loading a
    filesytem module unless the filesystem is mounted. In a user
    namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT,
    which most filesystems do not set today.

    Acked-by: Serge Hallyn
    Acked-by: Kees Cook
    Reported-by: Kees Cook
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

27 Feb, 2013

1 commit

  • Pull vfs pile (part one) from Al Viro:
    "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
    locking violations, etc.

    The most visible changes here are death of FS_REVAL_DOT (replaced with
    "has ->d_weak_revalidate()") and a new helper getting from struct file
    to inode. Some bits of preparation to xattr method interface changes.

    Misc patches by various people sent this cycle *and* ocfs2 fixes from
    several cycles ago that should've been upstream right then.

    PS: the next vfs pile will be xattr stuff."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
    saner proc_get_inode() calling conventions
    proc: avoid extra pde_put() in proc_fill_super()
    fs: change return values from -EACCES to -EPERM
    fs/exec.c: make bprm_mm_init() static
    ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
    ocfs2: fix possible use-after-free with AIO
    ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
    get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
    target: writev() on single-element vector is pointless
    export kernel_write(), convert open-coded instances
    fs: encode_fh: return FILEID_INVALID if invalid fid_type
    kill f_vfsmnt
    vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
    nfsd: handle vfs_getattr errors in acl protocol
    switch vfs_getattr() to struct path
    default SET_PERSONALITY() in linux/elf.h
    ceph: prepopulate inodes only when request is aborted
    d_hash_and_lookup(): export, switch open-coded instances
    9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
    9p: split dropping the acls from v9fs_set_create_acl()
    ...

    Linus Torvalds
     

26 Feb, 2013

3 commits

  • According to SUSv3:

    [EACCES] Permission denied. An attempt was made to access a file in a way
    forbidden by its file access permissions.

    [EPERM] Operation not permitted. An attempt was made to perform an operation
    limited to processes with appropriate privileges or to the owner of a file
    or other resource.

    So -EPERM should be returned if capability checks fails.

    Strictly speaking this is an API change since the error code user sees is
    altered.

    Signed-off-by: Zhao Hongjiang
    Acked-by: Jan Kara
    Acked-by: Steven Whitehouse
    Acked-by: Ian Kent
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Zhao Hongjiang
     
  • This patch is a follow up on below patch:

    [PATCH] exportfs: add FILEID_INVALID to indicate invalid fid_type
    commit: 216b6cbdcbd86b1db0754d58886b466ae31f5a63

    Signed-off-by: Namjae Jeon
    Signed-off-by: Vivek Trivedi
    Acked-by: Steven Whitehouse
    Acked-by: Sage Weil
    Signed-off-by: Al Viro

    Namjae Jeon
     
  • Pull user namespace and namespace infrastructure changes from Eric W Biederman:
    "This set of changes starts with a few small enhnacements to the user
    namespace. reboot support, allowing more arbitrary mappings, and
    support for mounting devpts, ramfs, tmpfs, and mqueuefs as just the
    user namespace root.

    I do my best to document that if you care about limiting your
    unprivileged users that when you have the user namespace support
    enabled you will need to enable memory control groups.

    There is a minor bug fix to prevent overflowing the stack if someone
    creates way too many user namespaces.

    The bulk of the changes are a continuation of the kuid/kgid push down
    work through the filesystems. These changes make using uids and gids
    typesafe which ensures that these filesystems are safe to use when
    multiple user namespaces are in use. The filesystems converted for
    3.9 are ceph, 9p, afs, ocfs2, gfs2, ncpfs, nfs, nfsd, and cifs. The
    changes for these filesystems were a little more involved so I split
    the changes into smaller hopefully obviously correct changes.

    XFS is the only filesystem that remains. I was hoping I could get
    that in this release so that user namespace support would be enabled
    with an allyesconfig or an allmodconfig but it looks like the xfs
    changes need another couple of days before it they are ready."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (93 commits)
    cifs: Enable building with user namespaces enabled.
    cifs: Convert struct cifs_ses to use a kuid_t and a kgid_t
    cifs: Convert struct cifs_sb_info to use kuids and kgids
    cifs: Modify struct smb_vol to use kuids and kgids
    cifs: Convert struct cifsFileInfo to use a kuid
    cifs: Convert struct cifs_fattr to use kuid and kgids
    cifs: Convert struct tcon_link to use a kuid.
    cifs: Modify struct cifs_unix_set_info_args to hold a kuid_t and a kgid_t
    cifs: Convert from a kuid before printing current_fsuid
    cifs: Use kuids and kgids SID to uid/gid mapping
    cifs: Pass GLOBAL_ROOT_UID and GLOBAL_ROOT_GID to keyring_alloc
    cifs: Use BUILD_BUG_ON to validate uids and gids are the same size
    cifs: Override unmappable incoming uids and gids
    nfsd: Enable building with user namespaces enabled.
    nfsd: Properly compare and initialize kuids and kgids
    nfsd: Store ex_anon_uid and ex_anon_gid as kuids and kgids
    nfsd: Modify nfsd4_cb_sec to use kuids and kgids
    nfsd: Handle kuids and kgids in the nfs4acl to posix_acl conversion
    nfsd: Convert nfsxdr to use kuids and kgids
    nfsd: Convert nfs3xdr to use kuids and kgids
    ...

    Linus Torvalds
     

23 Feb, 2013

1 commit


22 Feb, 2013

1 commit

  • Create a helper function to check if a backing device requires stable
    page writes and, if so, performs the necessary wait. Then, make it so
    that all points in the memory manager that handle making pages writable
    use the helper function. This should provide stable page write support
    to most filesystems, while eliminating unnecessary waiting for devices
    that don't require the feature.

    Before this patchset, all filesystems would block, regardless of whether
    or not it was necessary. ext3 would wait, but still generate occasional
    checksum errors. The network filesystems were left to do their own
    thing, so they'd wait too.

    After this patchset, all the disk filesystems except ext3 and btrfs will
    wait only if the hardware requires it. ext3 (if necessary) snapshots
    pages instead of blocking, and btrfs provides its own bdi so the mm will
    never wait. Network filesystems haven't been touched, so either they
    provide their own stable page guarantees or they don't block at all.
    The blocking behavior is back to what it was before 3.0 if you don't
    have a disk requiring stable page writes.

    Here's the result of using dbench to test latency on ext2:

    3.8.0-rc3:
    Operation Count AvgLat MaxLat
    ----------------------------------------
    WriteX 109347 0.028 59.817
    ReadX 347180 0.004 3.391
    Flush 15514 29.828 287.283

    Throughput 57.429 MB/sec 4 clients 4 procs max_latency=287.290 ms

    3.8.0-rc3 + patches:
    WriteX 105556 0.029 4.273
    ReadX 335004 0.005 4.112
    Flush 14982 30.540 298.634

    Throughput 55.4496 MB/sec 4 clients 4 procs max_latency=298.650 ms

    As you can see, the maximum write latency drops considerably with this
    patch enabled. The other filesystems (ext3/ext4/xfs/btrfs) behave
    similarly, but see the cover letter for those results.

    Signed-off-by: Darrick J. Wong
    Acked-by: Steven Whitehouse
    Reviewed-by: Jan Kara
    Cc: Adrian Hunter
    Cc: Andy Lutomirski
    Cc: Artem Bityutskiy
    Cc: Joel Becker
    Cc: Mark Fasheh
    Cc: Jens Axboe
    Cc: Eric Van Hensbergen
    Cc: Ron Minnich
    Cc: Latchesar Ionkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Darrick J. Wong
     

13 Feb, 2013

13 commits


02 Feb, 2013

2 commits

  • This patch allocates a block reservation structure before growing
    or shrinking a file. Without this structure, the grow or shink code
    can reference the bad pointer.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     
  • The intent here is to split the processing of the glock lru
    list into two parts, so that the selection of glocks and the
    disposal are separate functions. The plan is then, that further
    updates can then be made to these functions in the future
    to improve the selection of glocks and also the efficiency of
    glock disposal.

    The new feature which this patch brings is sorting the
    glocks to be disposed of into glock number (and thus also
    disk block number) order. Not all glocks will need i/o in
    order to dispose of them, but some will, and at least we'll
    generate mostly disk block order i/o now.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

29 Jan, 2013

7 commits

  • Instead of using a list of buffers to write ahead of the journal
    flush, this now uses a list of inodes and calls ->writepages
    via filemap_fdatawrite() in order to achieve the same thing. For
    most use cases this results in a shorter ordered write list,
    as well as much larger i/os being issued.

    The ordered write list is sorted by inode number before writing
    in order to retain the disk block ordering between inodes as
    per the previous code.

    The previous ordered write code used to conflict in its assumptions
    about how to write out the disk blocks with mpage_writepages()
    so that with this updated version we can also use mpage_writepages()
    for GFS2's ordered write, writepages implementation. So we will
    also send larger i/os from writeback too.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • The freeze code has not been looked at a lot recently. Upstream has
    moved on, and this is an attempt to catch us back up again. There
    is a vfs level interface for the freeze code which can be called
    from our (obsolete, but kept for backward compatibility purposes)
    sysfs freeze interface. This means freezing this way vs. doing it
    from the ioctl should now work in identical fashion.

    As a result of this, the freeze function is only called once
    and we can drop our own special purpose code for counting the
    number of freezes.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • The locking in gfs2_attach_bufdata() was type specific (data/meta)
    which made the function rather confusing. This patch moves the core
    of gfs2_attach_bufdata() into trans.c renaming it gfs2_alloc_bufdata()
    and moving the locking into gfs2_trans_add_data()/gfs2_trans_add_meta()

    As a result all of the locking related to adding data and metadata to
    the journal is now in these two functions. This should help to clarify
    what is going on, and give us some opportunities to simplify in
    some cases.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • This patch copies the body of gfs2_trans_add_bh into the two newly
    added gfs2_trans_add_data and gfs2_trans_add_meta functions. We can
    then move the .lo_add functions from lops.c into trans.c and call
    them directly.

    As a result of this, we no longer need to use the .lo_add functions
    at all, so that is removed from the log operations structure.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • There is little common content in gfs2_trans_add_bh() between the data
    and meta classes by the time that the functions which it calls are
    taken into account. The intent here is to split this into two
    separate functions. Stage one is to introduce gfs2_trans_add_data()
    and gfs2_trans_add_meta() and update the callers accordingly.

    Later patches will then pull in the content of gfs2_trans_add_bh()
    and its dependent functions in order to clean up the code in this
    area.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • This moves the lo_add function for revokes into trans.c, removing
    a function call and making the code easier to read.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • This breaks out the LRU scanning function from the shrinker in
    preparation for adding other callers to the LRU scanner.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

28 Jan, 2013

1 commit


02 Jan, 2013

4 commits

  • In function rg_mblk_search, it's searching for multiple blocks in
    a given state (e.g. "free"). If there's an active block reservation
    its goal is the next free block of that. If the resource group
    contains the dinode's goal block, that's used for the search. But
    if neither is the case, it uses the rgrp's last allocated block.
    That way, consecutive allocations appear after one another on media.
    The problem comes in when you hit the end of the rgrp; it would never
    start over and search from the beginning. This became a problem,
    since if you deleted all the files and data from the rgrp, it would
    never start over and find free blocks. So it had to keep searching
    further out on the media to allocate blocks. This patch resets the
    rd_last_alloc after it does an unsuccessful search at the end of
    the rgrp.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     
  • This patch adds a return code check after calling function
    gfs2_rbm_from_block while determining the free extent size.
    That way, when the end of an rgrp is reached, it won't try
    to process unaligned blocks after the end.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     
  • QE aio tests uncovered a race condition in gfs2_rs_alloc where it's possible
    to come out of the function with a valid ip->i_res allocation but it gets
    freed before use resulting in a NULL ptr dereference.

    This patch envelopes the initial short-circuit check for non-NULL ip->i_res
    into the mutex lock. With this patch, I was able to successfully run the
    reproducer test multiple times.

    Resolves: rhbz#878476
    Signed-off-by: Abhi Das
    Signed-off-by: Steven Whitehouse

    Abhijith Das
     
  • When generating the DLM lock name, a value of 0 would skip
    the loop and leave the string unchanged. This left locks with
    a value of 0 unlabeled. Initializing the string to '0' fixes this.

    Signed-off-by: Nathan Straz
    Signed-off-by: Steven Whitehouse

    Nathan Straz
     

18 Dec, 2012

1 commit


16 Dec, 2012

1 commit

  • Pull GFS2 updates from Steven Whitehouse:
    "The main feature this time is the new Orlov allocator and the patches
    leading up to it which allow us to allocate new inodes from their own
    allocation context, rather than borrowing that of their parent
    directory. It is this change which then allows us to choose a
    different location for subdirectories when required. This works
    exactly as per the ext3 implementation from the users point of view.

    In addition to that, we've got a speed up in gfs2_rbm_from_block()
    from Bob Peterson, three locking related improvements from Dave
    Teigland plus a selection of smaller bug fixes and clean ups."

    * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw:
    GFS2: Set gl_object during inode create
    GFS2: add error check while allocating new inodes
    GFS2: don't reference inode's glock during block allocation trace
    GFS2: remove redundant lvb pointer
    GFS2: only use lvb on glocks that need it
    GFS2: skip dlm_unlock calls in unmount
    GFS2: Fix one RG corner case
    GFS2: Eliminate redundant buffer_head manipulation in gfs2_unlink_inode
    GFS2: Use dirty_inode in gfs2_dir_add
    GFS2: Fix truncation of journaled data files
    GFS2: Add Orlov allocator
    GFS2: Use proper allocation context for new inodes
    GFS2: Add test for resource group congestion status
    GFS2: Rename glops go_xmote_th to go_sync
    GFS2: Speed up gfs2_rbm_from_block
    GFS2: Review bug traps in glops.c

    Linus Torvalds
     

12 Dec, 2012

1 commit

  • Overhaul struct address_space.assoc_mapping renaming it to
    address_space.private_data and its type is redefined to void*. By this
    approach we consistently name the .private_* elements from struct
    address_space as well as allow extended usage for address_space
    association with other data structures through ->private_data.

    Also, all users of old ->assoc_mapping element are converted to reflect
    its new name and type change (->private_data).

    Signed-off-by: Rafael Aquini
    Cc: Rusty Russell
    Cc: "Michael S. Tsirkin"
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andi Kleen
    Cc: Konrad Rzeszutek Wilk
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael Aquini
     

21 Nov, 2012

1 commit

  • This patch fixes a cluster coherency problem that occurs when one
    node creates a file, does several writes, then a different node
    tries to write to the same file. When the inode's glock is demoted,
    the inode wasn't synced to the media properly because the gl_object
    wasn't set. Later, the flush daemon noticed the uncommitted data
    and tried to flush it, only to discover the glock was no longer locked
    properly in exclusive mode. That caused an assert withdraw.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     

16 Nov, 2012

2 commits