01 May, 2013

2 commits

  • Pull GFS2 updates from Steven Whitehouse:
    "There is not a whole lot of change this time - there are some further
    changes which are in the works, but those will be held over until next
    time.

    Here there are some clean ups to inode creation, the addition of an
    origin (local or remote) indicator to glock demote requests, removal
    of one of the remaining GFP_NOFAIL allocations during log flushes, one
    minor clean up, and a one liner bug fix."

    * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw:
    GFS2: Flush work queue before clearing glock hash tables
    GFS2: Add origin indicator to glock demote tracing
    GFS2: Add origin indicator to glock callbacks
    GFS2: replace gfs2_ail structure with gfs2_trans
    GFS2: Remove vestigial parameter ip from function rs_deltree
    GFS2: Use gfs2_dinode_out() in the inode create path
    GFS2: Remove gfs2_refresh_inode from inode creation path
    GFS2: Clean up inode creation path

    Linus Torvalds
     
  • Pull trivial tree updates from Jiri Kosina:
    "Usual stuff, mostly comment fixes, typo fixes, printk fixes and small
    code cleanups"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (45 commits)
    mm: Convert print_symbol to %pSR
    gfs2: Convert print_symbol to %pSR
    m32r: Convert print_symbol to %pSR
    iostats.txt: add easy-to-find description for field 6
    x86 cmpxchg.h: fix wrong comment
    treewide: Fix typo in printk and comments
    doc: devicetree: Fix various typos
    docbook: fix 8250 naming in device-drivers
    pata_pdc2027x: Fix compiler warning
    treewide: Fix typo in printks
    mei: Fix comments in drivers/misc/mei
    treewide: Fix typos in kernel messages
    pm44xx: Fix comment for "CONFIG_CPU_IDLE"
    doc: Fix typo "CONFIG_CGROUP_CGROUP_MEMCG_SWAP"
    mmzone: correct "pags" to "pages" in comment.
    kernel-parameters: remove outdated 'noresidual' parameter
    Remove spurious _H suffixes from ifdef comments
    sound: Remove stray pluses from Kconfig file
    radio-shark: Fix printk "CONFIG_LED_CLASS"
    doc: put proper reference to CONFIG_MODULE_SIG_ENFORCE
    ...

    Linus Torvalds
     

29 Apr, 2013

1 commit


26 Apr, 2013

1 commit

  • There was a timing window when a GFS2 file system was unmounted
    that caused GFS2 to call BUG() and panic the kernel. The call
    to BUG() is meant to ensure that the glock reference count,
    gl_ref, never gets down to zero and bounce back up again. What was
    happening during umount is that function gfs2_put_super was dequeing
    its glocks for well-known files. In particular, we saw it on the
    journal glock, sd_jinode_gh. The dequeue caused delayed work to be
    queued for the glock state machine, to transition the lock to an
    "unlocked" state. While the work was still queued, gfs2_put_super
    called gfs2_gl_hash_clear to clear out the glock hash tables.
    If the timing was just so, the glock work function would drop the
    reference count at the time when it was being checked for zero,
    and that caused BUG() to be called. This patch calls
    flush_workqueue before clearing the glock hash tables, thereby
    ensuring that the delayed work is executed before the hash tables
    are cleared, and therefore the reference count never goes to zero
    until the glock is cleared.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     

10 Apr, 2013

2 commits

  • This adds the origin indicator to the trace point for glock
    demotion, so that it is possible to see where demote requests
    have come from.

    Note that requests generated from the demote_rq sysfs interface
    will show as remote, since they are intended to replicate
    exactly the effect of a demote reuqest from a remote node. It
    is still possible to tell these apart by looking at the process
    which initiated the demote request.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • This patch adds a bool indicating whether the demote
    request was originated locally or remotely. This is then
    used by the iopen ->go_callback() to make 100% sure that
    it will only respond to remote callbacks.

    Since ->evict_inode() uses GL_NOCACHE when it attempts to
    get an exclusive lock on the iopen lock, this may result
    in extra scheduling of the workqueue in case that the
    exclusive promotion request failed. This patch prevents
    that from happening.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

08 Apr, 2013

5 commits

  • In order to allow transactions and log flushes to happen at the same
    time, gfs2 needs to move the transaction accounting and active items
    list code into the gfs2_trans structure. As a first step toward this,
    this patch removes the gfs2_ail structure, and handles the active items
    list in the gfs_trans structure. This keeps gfs2 from allocating an ail
    structure on log flushes, and gives us a struture that can later be used
    to store the transaction accounting outside of the gfs2 superblock
    structure.

    With this patch, at the end of a transaction, gfs2 will add the
    gfs2_trans structure to the superblock if there is not one already.
    This structure now has the active items fields that were previously in
    gfs2_ail. This is not necessary in the case where the transaction was
    simply used to add revokes, since these are never written outside of the
    journal, and thus, don't need an active items list.

    Also, in order to make sure that the transaction structure is not
    removed while it's still in use by gfs2_trans_end, unlocking the
    sd_log_flush_lock has to happen slightly later in ending the
    transaction.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski
     
  • The functions that delete block reservations from the rgrp block
    reservations rbtree no longer use the ip parameter. This patch
    eliminates the parameter.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     
  • Over the previous two patches relating to inode creation, the
    content of init_dinode() has been looking more and more like
    gfs2_dinode_out(). This is not an accident! This patch replaces
    the parts of init_dinode() which are duplicated in gfs2_dinode_out()
    with a call to that function.

    Mostly that is straightforward, but there is one issue which needed
    to be resolved relating to the link count. The link count has to be
    set to zero in a certain error handling code path, which lands up
    calling iput(). This is now done specifically in that code path
    allowing the link count to be set earlier and written into the
    on disk inode by gfs2_dinode_put() in the normal way.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • The original method for creating inodes used in GFS2 was to fill
    out a buffer, with all the information, and then to read that
    buffer into the in-core inode, using gfs2_refresh_inode()

    The problem with this approach is that all the inode's fields
    need to be calculated ahead of time, and were stored in various
    variables making the code rather complicated.

    The new approach is simply to allocate the in-core inode earlier
    and fill in as many fields as possible ahead of time. These can
    then be used to initilise the on disk representation. The
    code has been working towards the point where it is possible
    to remove gfs2_refresh_inode() because all the fields are
    correctly initialised ahead of time. We've now reached that
    milestone, and have reversed the order of setting up the in
    core and on disk inodes.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • This patch cleans up the inode creation code path in GFS2. After the
    Orlov allocator was merged, a number of potential improvements are
    now possible, and this is a first set of these.

    The quota handling is now updated so that it matches the point in
    the code where the allocation takes place. This means that the one
    exception in gfs2_alloc_blocks relating to quota is now no longer
    required, and we can use the generic code everywhere.

    In addition the call to figure out whether we need to allocate any
    extra blocks in order to add a directory entry is moved higher up
    gfs2_create_inode. This means that if it returns an error, we
    can deal with that at a stage where it is easier to handle that case.
    The returned status cannot change during the function since we hold
    an exclusive lock on the directory.

    Two calls to gfs2_rindex_update have been changed to one, again at
    the top of gfs2_create_inode to simplify error handling.

    The time stamps are also now initialised earlier in the creation
    process, this is gradually moving towards being able to remove the
    call to gfs2_refresh_inode in gfs2_inode_create once we have all the
    fields covered.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

06 Apr, 2013

1 commit

  • This patch changes GFS2's discard issuing code so that it calls
    function sb_issue_discard rather than blkdev_issue_discard. The
    code was calling blkdev_issue_discard and specifying the correct
    sector offset and sector size, but blkdev_issue_discard expects
    these values to be in terms of 512 byte sectors, even if the native
    sector size for the device is different. Calling sb_issue_discard
    with the BLOCK size instead ensures the correct block-to-512b-sector
    translation. I verified that "minlen" is specified in blocks, so
    comparing it to a number of blocks is correct.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     

04 Apr, 2013

4 commits


04 Mar, 2013

1 commit

  • Modify the request_module to prefix the file system type with "fs-"
    and add aliases to all of the filesystems that can be built as modules
    to match.

    A common practice is to build all of the kernel code and leave code
    that is not commonly needed as modules, with the result that many
    users are exposed to any bug anywhere in the kernel.

    Looking for filesystems with a fs- prefix limits the pool of possible
    modules that can be loaded by mount to just filesystems trivially
    making things safer with no real cost.

    Using aliases means user space can control the policy of which
    filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf
    with blacklist and alias directives. Allowing simple, safe,
    well understood work-arounds to known problematic software.

    This also addresses a rare but unfortunate problem where the filesystem
    name is not the same as it's module name and module auto-loading
    would not work. While writing this patch I saw a handful of such
    cases. The most significant being autofs that lives in the module
    autofs4.

    This is relevant to user namespaces because we can reach the request
    module in get_fs_type() without having any special permissions, and
    people get uncomfortable when a user specified string (in this case
    the filesystem type) goes all of the way to request_module.

    After having looked at this issue I don't think there is any
    particular reason to perform any filtering or permission checks beyond
    making it clear in the module request that we want a filesystem
    module. The common pattern in the kernel is to call request_module()
    without regards to the users permissions. In general all a filesystem
    module does once loaded is call register_filesystem() and go to sleep.
    Which means there is not much attack surface exposed by loading a
    filesytem module unless the filesystem is mounted. In a user
    namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT,
    which most filesystems do not set today.

    Acked-by: Serge Hallyn
    Acked-by: Kees Cook
    Reported-by: Kees Cook
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

27 Feb, 2013

1 commit

  • Pull vfs pile (part one) from Al Viro:
    "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
    locking violations, etc.

    The most visible changes here are death of FS_REVAL_DOT (replaced with
    "has ->d_weak_revalidate()") and a new helper getting from struct file
    to inode. Some bits of preparation to xattr method interface changes.

    Misc patches by various people sent this cycle *and* ocfs2 fixes from
    several cycles ago that should've been upstream right then.

    PS: the next vfs pile will be xattr stuff."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
    saner proc_get_inode() calling conventions
    proc: avoid extra pde_put() in proc_fill_super()
    fs: change return values from -EACCES to -EPERM
    fs/exec.c: make bprm_mm_init() static
    ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
    ocfs2: fix possible use-after-free with AIO
    ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
    get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
    target: writev() on single-element vector is pointless
    export kernel_write(), convert open-coded instances
    fs: encode_fh: return FILEID_INVALID if invalid fid_type
    kill f_vfsmnt
    vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
    nfsd: handle vfs_getattr errors in acl protocol
    switch vfs_getattr() to struct path
    default SET_PERSONALITY() in linux/elf.h
    ceph: prepopulate inodes only when request is aborted
    d_hash_and_lookup(): export, switch open-coded instances
    9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
    9p: split dropping the acls from v9fs_set_create_acl()
    ...

    Linus Torvalds
     

26 Feb, 2013

3 commits

  • According to SUSv3:

    [EACCES] Permission denied. An attempt was made to access a file in a way
    forbidden by its file access permissions.

    [EPERM] Operation not permitted. An attempt was made to perform an operation
    limited to processes with appropriate privileges or to the owner of a file
    or other resource.

    So -EPERM should be returned if capability checks fails.

    Strictly speaking this is an API change since the error code user sees is
    altered.

    Signed-off-by: Zhao Hongjiang
    Acked-by: Jan Kara
    Acked-by: Steven Whitehouse
    Acked-by: Ian Kent
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Zhao Hongjiang
     
  • This patch is a follow up on below patch:

    [PATCH] exportfs: add FILEID_INVALID to indicate invalid fid_type
    commit: 216b6cbdcbd86b1db0754d58886b466ae31f5a63

    Signed-off-by: Namjae Jeon
    Signed-off-by: Vivek Trivedi
    Acked-by: Steven Whitehouse
    Acked-by: Sage Weil
    Signed-off-by: Al Viro

    Namjae Jeon
     
  • Pull user namespace and namespace infrastructure changes from Eric W Biederman:
    "This set of changes starts with a few small enhnacements to the user
    namespace. reboot support, allowing more arbitrary mappings, and
    support for mounting devpts, ramfs, tmpfs, and mqueuefs as just the
    user namespace root.

    I do my best to document that if you care about limiting your
    unprivileged users that when you have the user namespace support
    enabled you will need to enable memory control groups.

    There is a minor bug fix to prevent overflowing the stack if someone
    creates way too many user namespaces.

    The bulk of the changes are a continuation of the kuid/kgid push down
    work through the filesystems. These changes make using uids and gids
    typesafe which ensures that these filesystems are safe to use when
    multiple user namespaces are in use. The filesystems converted for
    3.9 are ceph, 9p, afs, ocfs2, gfs2, ncpfs, nfs, nfsd, and cifs. The
    changes for these filesystems were a little more involved so I split
    the changes into smaller hopefully obviously correct changes.

    XFS is the only filesystem that remains. I was hoping I could get
    that in this release so that user namespace support would be enabled
    with an allyesconfig or an allmodconfig but it looks like the xfs
    changes need another couple of days before it they are ready."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (93 commits)
    cifs: Enable building with user namespaces enabled.
    cifs: Convert struct cifs_ses to use a kuid_t and a kgid_t
    cifs: Convert struct cifs_sb_info to use kuids and kgids
    cifs: Modify struct smb_vol to use kuids and kgids
    cifs: Convert struct cifsFileInfo to use a kuid
    cifs: Convert struct cifs_fattr to use kuid and kgids
    cifs: Convert struct tcon_link to use a kuid.
    cifs: Modify struct cifs_unix_set_info_args to hold a kuid_t and a kgid_t
    cifs: Convert from a kuid before printing current_fsuid
    cifs: Use kuids and kgids SID to uid/gid mapping
    cifs: Pass GLOBAL_ROOT_UID and GLOBAL_ROOT_GID to keyring_alloc
    cifs: Use BUILD_BUG_ON to validate uids and gids are the same size
    cifs: Override unmappable incoming uids and gids
    nfsd: Enable building with user namespaces enabled.
    nfsd: Properly compare and initialize kuids and kgids
    nfsd: Store ex_anon_uid and ex_anon_gid as kuids and kgids
    nfsd: Modify nfsd4_cb_sec to use kuids and kgids
    nfsd: Handle kuids and kgids in the nfs4acl to posix_acl conversion
    nfsd: Convert nfsxdr to use kuids and kgids
    nfsd: Convert nfs3xdr to use kuids and kgids
    ...

    Linus Torvalds
     

23 Feb, 2013

1 commit


22 Feb, 2013

1 commit

  • Create a helper function to check if a backing device requires stable
    page writes and, if so, performs the necessary wait. Then, make it so
    that all points in the memory manager that handle making pages writable
    use the helper function. This should provide stable page write support
    to most filesystems, while eliminating unnecessary waiting for devices
    that don't require the feature.

    Before this patchset, all filesystems would block, regardless of whether
    or not it was necessary. ext3 would wait, but still generate occasional
    checksum errors. The network filesystems were left to do their own
    thing, so they'd wait too.

    After this patchset, all the disk filesystems except ext3 and btrfs will
    wait only if the hardware requires it. ext3 (if necessary) snapshots
    pages instead of blocking, and btrfs provides its own bdi so the mm will
    never wait. Network filesystems haven't been touched, so either they
    provide their own stable page guarantees or they don't block at all.
    The blocking behavior is back to what it was before 3.0 if you don't
    have a disk requiring stable page writes.

    Here's the result of using dbench to test latency on ext2:

    3.8.0-rc3:
    Operation Count AvgLat MaxLat
    ----------------------------------------
    WriteX 109347 0.028 59.817
    ReadX 347180 0.004 3.391
    Flush 15514 29.828 287.283

    Throughput 57.429 MB/sec 4 clients 4 procs max_latency=287.290 ms

    3.8.0-rc3 + patches:
    WriteX 105556 0.029 4.273
    ReadX 335004 0.005 4.112
    Flush 14982 30.540 298.634

    Throughput 55.4496 MB/sec 4 clients 4 procs max_latency=298.650 ms

    As you can see, the maximum write latency drops considerably with this
    patch enabled. The other filesystems (ext3/ext4/xfs/btrfs) behave
    similarly, but see the cover letter for those results.

    Signed-off-by: Darrick J. Wong
    Acked-by: Steven Whitehouse
    Reviewed-by: Jan Kara
    Cc: Adrian Hunter
    Cc: Andy Lutomirski
    Cc: Artem Bityutskiy
    Cc: Joel Becker
    Cc: Mark Fasheh
    Cc: Jens Axboe
    Cc: Eric Van Hensbergen
    Cc: Ron Minnich
    Cc: Latchesar Ionkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Darrick J. Wong
     

13 Feb, 2013

13 commits


02 Feb, 2013

2 commits

  • This patch allocates a block reservation structure before growing
    or shrinking a file. Without this structure, the grow or shink code
    can reference the bad pointer.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     
  • The intent here is to split the processing of the glock lru
    list into two parts, so that the selection of glocks and the
    disposal are separate functions. The plan is then, that further
    updates can then be made to these functions in the future
    to improve the selection of glocks and also the efficiency of
    glock disposal.

    The new feature which this patch brings is sorting the
    glocks to be disposed of into glock number (and thus also
    disk block number) order. Not all glocks will need i/o in
    order to dispose of them, but some will, and at least we'll
    generate mostly disk block order i/o now.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

29 Jan, 2013

2 commits

  • Instead of using a list of buffers to write ahead of the journal
    flush, this now uses a list of inodes and calls ->writepages
    via filemap_fdatawrite() in order to achieve the same thing. For
    most use cases this results in a shorter ordered write list,
    as well as much larger i/os being issued.

    The ordered write list is sorted by inode number before writing
    in order to retain the disk block ordering between inodes as
    per the previous code.

    The previous ordered write code used to conflict in its assumptions
    about how to write out the disk blocks with mpage_writepages()
    so that with this updated version we can also use mpage_writepages()
    for GFS2's ordered write, writepages implementation. So we will
    also send larger i/os from writeback too.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • The freeze code has not been looked at a lot recently. Upstream has
    moved on, and this is an attempt to catch us back up again. There
    is a vfs level interface for the freeze code which can be called
    from our (obsolete, but kept for backward compatibility purposes)
    sysfs freeze interface. This means freezing this way vs. doing it
    from the ioctl should now work in identical fashion.

    As a result of this, the freeze function is only called once
    and we can drop our own special purpose code for counting the
    number of freezes.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse