18 Jan, 2012

16 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit: (29 commits)
    audit: no leading space in audit_log_d_path prefix
    audit: treat s_id as an untrusted string
    audit: fix signedness bug in audit_log_execve_info()
    audit: comparison on interprocess fields
    audit: implement all object interfield comparisons
    audit: allow interfield comparison between gid and ogid
    audit: complex interfield comparison helper
    audit: allow interfield comparison in audit rules
    Kernel: Audit Support For The ARM Platform
    audit: do not call audit_getname on error
    audit: only allow tasks to set their loginuid if it is -1
    audit: remove task argument to audit_set_loginuid
    audit: allow audit matching on inode gid
    audit: allow matching on obj_uid
    audit: remove audit_finish_fork as it can't be called
    audit: reject entry,always rules
    audit: inline audit_free to simplify the look of generic code
    audit: drop audit_set_macxattr as it doesn't do anything
    audit: inline checks for not needing to collect aux records
    audit: drop some potentially inadvisable likely notations
    ...

    Use evil merge to fix up grammar mistakes in Kconfig file.

    Bad speling and horrible grammar (and copious swearing) is to be
    expected, but let's keep it to commit messages and comments, rather than
    expose it to users in config help texts or printouts.

    Linus Torvalds
     
  • * 'for-linus' of git://oss.sgi.com/xfs/xfs:
    xfs: cleanup xfs_file_aio_write
    xfs: always return with the iolock held from xfs_file_aio_write_checks
    xfs: remove the i_new_size field in struct xfs_inode
    xfs: remove the i_size field in struct xfs_inode
    xfs: replace i_pin_wait with a bit waitqueue
    xfs: replace i_flock with a sleeping bitlock
    xfs: make i_flags an unsigned long
    xfs: remove the if_ext_max field in struct xfs_ifork
    xfs: remove the unused dm_attrs structure
    xfs: cleanup xfs_iomap_eof_align_last_fsb
    xfs: remove xfs_itruncate_data

    Linus Torvalds
     
  • * 'btrfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    btrfs: take allocation of ->tree_root into open_ctree()
    btrfs: let ->s_fs_info point to fs_info, not root...
    btrfs: consolidate failure exits in btrfs_mount() a bit
    btrfs: make free_fs_info() call ->kill_sb() unconditional
    btrfs: merge free_fs_info() calls on fill_super failures
    btrfs: kill pointless reassignment of ->s_fs_info in btrfs_fill_super()
    btrfs: make open_ctree() return int
    btrfs: sanitizing ->fs_info, part 5
    btrfs: sanitizing ->fs_info, part 4
    btrfs: sanitizing ->fs_info, part 3
    btrfs: sanitizing ->fs_info, part 2
    btrfs: sanitizing ->fs_info, part 1
    btrfs: fix a deadlock in btrfs_scan_one_device()
    btrfs: fix mount/umount race
    btrfs: get ->kill_sb() of its own
    btrfs: preparation to fixing mount/umount race

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (62 commits)
    Btrfs: use larger system chunks
    Btrfs: add a delalloc mutex to inodes for delalloc reservations
    Btrfs: space leak tracepoints
    Btrfs: protect orphan block rsv with spin_lock
    Btrfs: add allocator tracepoints
    Btrfs: don't call btrfs_throttle in file write
    Btrfs: release space on error in page_mkwrite
    Btrfs: fix btrfsck error 400 when truncating a compressed
    Btrfs: do not use btrfs_end_transaction_throttle everywhere
    Btrfs: add balance progress reporting
    Btrfs: allow for resuming restriper after it was paused
    Btrfs: allow for canceling restriper
    Btrfs: allow for pausing restriper
    Btrfs: add skip_balance mount option
    Btrfs: recover balance on mount
    Btrfs: save balance parameters to disk
    Btrfs: soft profile changing mode (aka soft convert)
    Btrfs: implement online profile changing
    Btrfs: do not reduce profile in do_chunk_alloc()
    Btrfs: virtual address space subset filter
    ...

    Fix up trivial conflict in fs/btrfs/ioctl.c due to the use of the new
    mnt_drop_write_file() helper.

    Linus Torvalds
     
  • Jüri Aedla reported that the /proc//mem handling really isn't very
    robust, and it also doesn't match the permission checking of any of the
    other related files.

    This changes it to do the permission checks at open time, and instead of
    tracking the process, it tracks the VM at the time of the open. That
    simplifies the code a lot, but does mean that if you hold the file
    descriptor open over an execve(), you'll continue to read from the _old_
    VM.

    That is different from our previous behavior, but much simpler. If
    somebody actually finds a load where this matters, we'll need to revert
    this commit.

    I suspect that nobody will ever notice - because the process mapping
    addresses will also have changed as part of the execve. So you cannot
    actually usefully access the fd across a VM change simply because all
    the offsets for IO would have changed too.

    Reported-by: Jüri Aedla
    Cc: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Just a code cleanup really. We don't need to make a function call just for
    it to return on error. This also makes the VFS function even easier to follow
    and removes a conditional on a hot path.

    Signed-off-by: Eric Paris

    Eric Paris
     
  • At the moment we allow tasks to set their loginuid if they have
    CAP_AUDIT_CONTROL. In reality we want tasks to set the loginuid when they
    log in and it be impossible to ever reset. We had to make it mutable even
    after it was once set (with the CAP) because on update and admin might have
    to restart sshd. Now sshd would get his loginuid and the next user which
    logged in using ssh would not be able to set his loginuid.

    Systemd has changed how userspace works and allowed us to make the kernel
    work the way it should. With systemd users (even admins) are not supposed
    to restart services directly. The system will restart the service for
    them. Thus since systemd is going to loginuid==-1, sshd would get -1, and
    sshd would be allowed to set a new loginuid without special permissions.

    If an admin in this system were to manually start an sshd he is inserting
    himself into the system chain of trust and thus, logically, it's his
    loginuid that should be used! Since we have old systems I make this a
    Kconfig option.

    Signed-off-by: Eric Paris

    Eric Paris
     
  • The function always deals with current. Don't expose an option
    pretending one can use it for something. You can't.

    Signed-off-by: Eric Paris

    Eric Paris
     
  • With all the size field updates out of the way xfs_file_aio_write can
    be further simplified by pushing all iolock handling into
    xfs_file_dio_aio_write and xfs_file_buffered_aio_write and using
    the generic generic_write_sync helper for synchronous writes.

    Reviewed-by: Dave Chinner
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • While xfs_iunlock is fine with 0 lockflags the calling conventions are much
    cleaner if xfs_file_aio_write_checks never returns without the iolock held.

    Reviewed-by: Dave Chinner
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • Now that we use the VFS i_size field throughout XFS there is no need for the
    i_new_size field any more given that the VFS i_size field gets updated
    in ->write_end before unlocking the page, and thus is always uptodate when
    writeback could see a page. Removing i_new_size also has the advantage that
    we will never have to trim back di_size during a failed buffered write,
    given that it never gets updated past i_size.

    Note that currently the generic direct I/O code only updates i_size after
    calling our end_io handler, which requires a small workaround to make
    sure di_size actually makes it to disk. I hope to fix this properly in
    the generic code.

    A downside is that we lose the support for parallel non-overlapping O_DIRECT
    appending writes that recently was added. I don't think keeping the complex
    and fragile i_new_size infrastructure for this is a good tradeoff - if we
    really care about parallel appending writers we should investigate turning
    the iolock into a range lock, which would also allow for parallel
    non-overlapping buffered writers.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • There is no fundamental need to keep an in-memory inode size copy in the XFS
    inode. We already have the on-disk value in the dinode, and the separate
    in-memory copy that we need for regular files only in the XFS inode.

    Remove the xfs_inode i_size field and change the XFS_ISIZE macro to use the
    VFS inode i_size field for regular files. Switch code that was directly
    accessing the i_size field in the xfs_inode to XFS_ISIZE, or in cases where
    we are limited to regular files direct access of the VFS inode i_size field.

    This also allows dropping some fairly complicated code in the write path
    which dealt with keeping the xfs_inode i_size uptodate with the VFS i_size
    that is getting updated inside ->write_end.

    Note that we do not bother resetting the VFS i_size when truncating a file
    that gets freed to zero as there is no point in doing so because the VFS inode
    is no longer in use at this point. Just relax the assert in xfs_ifree to
    only check the on-disk size instead.

    Reviewed-by: Dave Chinner
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • Replace i_pin_wait, which is only used during synchronous inode flushing
    with a bit waitqueue. This trades off a much smaller inode against
    slightly slower wakeup performance, and saves 12 (32-bit) or 20 (64-bit)
    bytes in the XFS inode.

    Reviewed-by: Alex Elder
    Reviewed-by: Dave Chinner
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • We almost never block on i_flock, the exception is synchronous inode
    flushing. Instead of bloating the inode with a 16/24-byte completion
    that we abuse as a semaphore just implement it as a bitlock that uses
    a bit waitqueue for the rare sleeping path. This primarily is a
    tradeoff between a much smaller inode and a faster non-blocking
    path vs faster wakeups, and we are much better off with the former.

    A small downside is that we will lose lockdep checking for i_flock, but
    given that it's always taken inside the ilock that should be acceptable.

    Note that for example the inode writeback locking is implemented in a
    very similar way.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Alex Elder
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • To be used for bit wakeup i_flags needs to be an unsigned long or we'll
    run into trouble on big endian systems. Because of the 1-byte i_update
    field right after it this actually causes a fairly large size increase
    on its own (4 or 8 bytes), but that increase will be more than offset
    by the next two patches.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Alex Elder
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • We spent a lot of effort to maintain this field, but it always equals to the
    fork size divided by the constant size of an extent. The prime use of it is
    to assert that the two stay in sync. Just divide the fork size by the extent
    size in the few places that we actually use it and remove the overhead
    of maintaining it. Also introduce a few helpers to consolidate the places
    where we actually care about the value.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

17 Jan, 2012

24 commits

  • NFS client bugfixes and cleanups for Linux 3.3 (pull 2)

    * tag 'nfs-for-3.3-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
    pnfsblock: alloc short extent before submit bio
    pnfsblock: remove rpc_call_ops from struct parallel_io
    pnfsblock: move find lock page logic out of bl_write_pagelist
    pnfsblock: cleanup bl_mark_sectors_init
    pnfsblock: limit bio page count
    pnfsblock: don't spinlock when freeing block_dev
    pnfsblock: clean up _add_entry
    pnfsblock: set read/write tk_status to pnfs_error
    pnfsblock: acquire im_lock in _preload_range
    NFS4: fix compile warnings in nfs4proc.c
    nfs: check for integer overflow in decode_devicenotify_args()
    NFS: cleanup endian type in decode_ds_addr()
    NFS: add an endian notation

    Linus Torvalds
     
  • system chunks by default are very small. This makes them slightly
    larger and also fixes the conditional checks to make sure we don't
    allocate a billion of them at once.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • I was using i_mutex for this, but we're getting bogus lockdep warnings by doing
    that and theres no real way to get rid of those, so just stop using i_mutex to
    protect delalloc metadata reservations and use a delalloc mutex instead. This
    shouldn't be contended often at all, only if you are writing and mmap writing to
    the file at the same time. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • This in addition to a script in my btrfs-tracing tree will help track down space
    leaks when we're getting space left over in block groups on umount. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • We've been seeing warnings coming out of the orphan commit stuff forever from
    ceph. Turns out it's because we're racing with checking if the orphan block
    reserve is set, because we clear it outside of the spin_lock. So leave the
    normal fastpath checks where they are, but take the spin_lock and _recheck_ to
    make sure we haven't had an orphan block rsv added in the meantime. Then clear
    the root's orphan block rsv and release the lock. With this patch a user said
    the warnings went away and they usually showed up pretty soon after he started
    ceph. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • I used these tracepoints when figuring out what the cluster stuff was doing, so
    add them to mainline in case we need to profile this stuff again. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • Btrfs_throttle will make us wait if there is a currently committing transaction
    until we can open new transactions, which is ridiculous since we don't actually
    start any transactions within the file write path anyway, so all this does is
    introduce big latencies if we have a sync/fsync heavy workload going on while
    somebody else is trying to do work. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • If updating the inode gave us an ENOSPC we were just returning in page_mkwrite,
    which is a problem since we make our reservation right before trying to update
    the inode, so fix the out label so that we actually free our reservation.
    Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • Reproduce steps:
    # mkfs.btrfs /dev/sdb5
    # mount /dev/sdb5 -o compress=lzo /mnt
    # dd if=/dev/zero of=/mnt/tmpfile bs=128K count=1
    # sync
    # truncate -s 64K /mnt/tmpfile
    root 5 inode 257 errors 400

    This is because of the wrong if condition, which is used to check if we should
    subtract the bytes of the dropped range from i_blocks/i_bytes of i-node or not.
    When we truncate a compressed extent, btrfs substracts the bytes of the whole
    extent, it's wrong. We should substract the real size that we truncate, no
    matter it is a compressed extent or not. Fix it.

    Signed-off-by: Miao Xie
    Signed-off-by: Chris Mason

    Miao Xie
     
  • A user reported a problem where things like open with O_CREAT would take up to
    30 seconds when he had nfs activity on the same mount. This is because all of
    our quick metadata operations, like create, symlink etc all do
    btrfs_end_transaction_throttle, which if the transaction is blocked will wait
    for the commit to complete before it returns. This adds a ridiculous amount of
    latency and isn't really needed. The normal btrfs_end_transaction will mark the
    transaction as blocked and wake the transaction kthread up if it thinks the
    transaction needs to end (this being in the running out of global reserve space
    scenario), and this is all that is really needed since we've already done
    everything we're going to do, we just need to return. This should help people
    with the latency they were seeing when using synchronous heavy workloads.
    Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • Conflicts:
    fs/btrfs/ctree.h
    fs/btrfs/super.c

    Signed-off-by: Chris Mason

    Chris Mason
     
  • Chris Mason
     
  • Conflicts:
    fs/btrfs/volumes.c

    Signed-off-by: Chris Mason

    Chris Mason
     
  • Chris Mason
     
  • Chris Mason
     
  • Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Recognize BTRFS_BALANCE_RESUME flag passed from userspace. We use the
    same heuristics used when recovering balance after a crash to try to
    start where we left off last time.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Implement an ioctl for canceling restriper. Currently we wait until
    relocation of the current block group is finished, in future this can be
    done by triggering a commit. Balance item is deleted and no memory
    about the interrupted balance is kept.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Implement an ioctl for pausing restriper. This pauses the relocation,
    but balance is still considered to be "in progress": balance item is
    not deleted, other volume operations cannot be started, etc. If paused
    in the middle of profile changing operation we will continue making
    allocations with the target profile.

    Add a hook to close_ctree() to pause restriper and free its data
    structures on unmount. (It's safe to unmount when restriper is in
    "paused" state, we will resume with the same parameters on the next
    mount)

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Since restriper kthread starts involuntarily on mount and can suck cpu
    and memory bandwidth add a mount option to forcefully skip it. The
    restriper in that case hangs around in paused state and can be resumed
    from userspace when it's convenient.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • On mount, if balance item is found, resume balance in a separate
    kernel thread.

    Try to be smart to continue roughly where previous balance (or convert)
    was interrupted. For chunk types that were being converted to some
    profile we turn on soft convert, in case of a simple balance we turn on
    usage filter and relocate only less-than-90%-full chunks of that type.
    These are just heuristics but they help quite a bit, and can be improved
    in future.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Introduce a new btree objectid for storing balance item. The reason is
    to be able to resume restriper after a crash with the same parameters.
    Balance item has a very high objectid and goes into tree of tree roots.

    The key for the new item is as follows:

    [ BTRFS_BALANCE_OBJECTID ; BTRFS_BALANCE_ITEM_KEY ; 0 ]

    Older kernels simply ignore it so it's safe to mount with an older
    kernel and then go back to the newer one.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • When doing convert from one profile to another if soft mode is on
    restriper won't touch chunks that already have the profile we are
    converting to. This is useful if e.g. half of the FS was converted
    earlier.

    The soft mode switch is (like every other filter) per-type. This means
    that we can convert for example meta chunks the "hard" way while
    converting data chunks selectively with soft switch.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Profile changing is done by launching a balance with
    BTRFS_BALANCE_CONVERT bits set and target fields of respective
    btrfs_balance_args structs initialized. Profile reducing code in this
    case will pick restriper's target profile if it's available instead of
    doing a blind reduce. If target profile is not yet available it goes
    back to a plain reduce.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov