09 Aug, 2013

3 commits

  • Previous commits released the write lock across quota operations but
    missed several places. In particular, the free operations can also
    call into the file system code and take the write lock, causing
    deadlocks.

    This patch introduces some more helpers and uses them for quota call
    sites. Without this patch applied, reiserfs + quotas runs into deadlocks
    under anything more than trivial load.

    Signed-off-by: Jeff Mahoney

    Jeff Mahoney
     
  • The reiserfs write lock replaced the BKL and uses similar semantics.

    Frederic's locking code makes a distinction between when the lock is nested
    and when it's being acquired/released, but I don't think that's the right
    distinction to make.

    The right distinction is between the lock being released at end-of-use and
    the lock being released for a schedule. The unlock should return the depth
    and the lock should restore it, rather than the other way around as it is now.

    This patch implements that and adds a number of places where the lock
    should be dropped.

    Signed-off-by: Jeff Mahoney

    Jeff Mahoney
     
  • The reiserfs xattr code doesn't need the write lock and sleeps all over
    the place. We can simplify the locking by releasing it and reacquiring
    after the xattr call.

    Signed-off-by: Jeff Mahoney

    Jeff Mahoney
     

03 Jul, 2013

1 commit

  • Pull ext4 update from Ted Ts'o:
    "Lots of bug fixes, cleanups and optimizations. In the bug fixes
    category, of note is a fix for on-line resizing file systems where the
    block size is smaller than the page size (i.e., file systems 1k blocks
    on x86, or more interestingly file systems with 4k blocks on Power or
    ia64 systems.)

    In the cleanup category, the ext4's punch hole implementation was
    significantly improved by Lukas Czerner, and now supports bigalloc
    file systems. In addition, Jan Kara significantly cleaned up the
    write submission code path. We also improved error checking and added
    a few sanity checks.

    In the optimizations category, two major optimizations deserve
    mention. The first is that ext4_writepages() is now used for
    nodelalloc and ext3 compatibility mode. This allows writes to be
    submitted much more efficiently as a single bio request, instead of
    being sent as individual 4k writes into the block layer (which then
    relied on the elevator code to coalesce the requests in the block
    queue). Secondly, the extent cache shrink mechanism, which was
    introduce in 3.9, no longer has a scalability bottleneck caused by the
    i_es_lru spinlock. Other optimizations include some changes to reduce
    CPU usage and to avoid issuing empty commits unnecessarily."

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (86 commits)
    ext4: optimize starting extent in ext4_ext_rm_leaf()
    jbd2: invalidate handle if jbd2_journal_restart() fails
    ext4: translate flag bits to strings in tracepoints
    ext4: fix up error handling for mpage_map_and_submit_extent()
    jbd2: fix theoretical race in jbd2__journal_restart
    ext4: only zero partial blocks in ext4_zero_partial_blocks()
    ext4: check error return from ext4_write_inline_data_end()
    ext4: delete unnecessary C statements
    ext3,ext4: don't mess with dir_file->f_pos in htree_dirblock_to_tree()
    jbd2: move superblock checksum calculation to jbd2_write_superblock()
    ext4: pass inode pointer instead of file pointer to punch hole
    ext4: improve free space calculation for inline_data
    ext4: reduce object size when !CONFIG_PRINTK
    ext4: improve extent cache shrink mechanism to avoid to burn CPU time
    ext4: implement error handling of ext4_mb_new_preallocation()
    ext4: fix corruption when online resizing a fs with 1K block size
    ext4: delete unused variables
    ext4: return FIEMAP_EXTENT_UNKNOWN for delalloc extents
    jbd2: remove debug dependency on debug_fs and update Kconfig help text
    jbd2: use a single printk for jbd_debug()
    ...

    Linus Torvalds
     

01 Jun, 2013

1 commit

  • Reiserfs is currently able to be deadlocked by having two NFS clients
    where one has removed and recreated a file and another is accessing the
    file with an open file handle.

    If one client deletes and recreates a file with timing such that the
    recreated file obtains the same [dirid, objectid] pair as the original
    file while another client accesses the file via file handle, the create
    and lookup can race and deadlock if the lookup manages to create the
    in-memory inode first.

    The create thread, in insert_inode_locked4, will hold the write lock
    while waiting on the other inode to be unlocked. The lookup thread,
    anywhere in the iget path, will release and reacquire the write lock while
    it schedules. If it needs to reacquire the lock while the create thread
    has it, it will never be able to make forward progress because it needs
    to reacquire the lock before ultimately unlocking the inode.

    This patch drops the write lock across the insert_inode_locked4 call so
    that the ordering of inode_wait -> write lock is retained. Since this
    would have been the case before the BKL push-down, this is safe.

    Signed-off-by: Jeff Mahoney
    Signed-off-by: Jan Kara

    Jeff Mahoney
     

22 May, 2013

2 commits

  • ->invalidatepage() aop now accepts range to invalidate so we can make
    use of it in reiserfs_invalidatepage()

    Signed-off-by: Lukas Czerner
    Cc: reiserfs-devel@vger.kernel.org

    Lukas Czerner
     
  • Currently there is no way to truncate partial page where the end
    truncate point is not at the end of the page. This is because it was not
    needed and the functionality was enough for file system truncate
    operation to work properly. However more file systems now support punch
    hole feature and it can benefit from mm supporting truncating page just
    up to the certain point.

    Specifically, with this functionality truncate_inode_pages_range() can
    be changed so it supports truncating partial page at the end of the
    range (currently it will BUG_ON() if 'end' is not at the end of the
    page).

    This commit changes the invalidatepage() address space operation
    prototype to accept range to be invalidated and update all the instances
    for it.

    We also change the block_invalidatepage() in the same way and actually
    make a use of the new length argument implementing range invalidation.

    Actual file system implementations will follow except the file systems
    where the changes are really simple and should not change the behaviour
    in any way .Implementation for truncate_page_range() which will be able
    to accept page unaligned ranges will follow as well.

    Signed-off-by: Lukas Czerner
    Cc: Andrew Morton
    Cc: Hugh Dickins

    Lukas Czerner
     

08 May, 2013

1 commit

  • Faster kernel compiles by way of fewer unnecessary includes.

    [akpm@linux-foundation.org: fix fallout]
    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Kent Overstreet
    Cc: Zach Brown
    Cc: Felipe Balbi
    Cc: Greg Kroah-Hartman
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Rusty Russell
    Cc: Jens Axboe
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Jeff Moyer
    Cc: Al Viro
    Cc: Benjamin LaHaise
    Reviewed-by: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kent Overstreet
     

26 Feb, 2013

1 commit


21 Dec, 2012

1 commit


20 Nov, 2012

1 commit


10 Oct, 2012

1 commit

  • Fuzzing with trinity oopsed on the 1st instruction of shmem_fh_to_dentry(),
    u64 inum = fid->raw[2];
    which is unhelpfully reported as at the end of shmem_alloc_inode():

    BUG: unable to handle kernel paging request at ffff880061cd3000
    IP: [] shmem_alloc_inode+0x40/0x40
    Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    Call Trace:
    [] ? exportfs_decode_fh+0x79/0x2d0
    [] do_handle_open+0x163/0x2c0
    [] sys_open_by_handle_at+0xc/0x10
    [] tracesys+0xe1/0xe6

    Right, tmpfs is being stupid to access fid->raw[2] before validating that
    fh_len includes it: the buffer kmalloc'ed by do_sys_name_to_handle() may
    fall at the end of a page, and the next page not be present.

    But some other filesystems (ceph, gfs2, isofs, reiserfs, xfs) are being
    careless about fh_len too, in fh_to_dentry() and/or fh_to_parent(), and
    could oops in the same way: add the missing fh_len checks to those.

    Reported-by: Sasha Levin
    Signed-off-by: Hugh Dickins
    Cc: Al Viro
    Cc: Sage Weil
    Cc: Steven Whitehouse
    Cc: Christoph Hellwig
    Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro

    Hugh Dickins
     

03 Oct, 2012

1 commit

  • Pull user namespace changes from Eric Biederman:
    "This is a mostly modest set of changes to enable basic user namespace
    support. This allows the code to code to compile with user namespaces
    enabled and removes the assumption there is only the initial user
    namespace. Everything is converted except for the most complex of the
    filesystems: autofs4, 9p, afs, ceph, cifs, coda, fuse, gfs2, ncpfs,
    nfs, ocfs2 and xfs as those patches need a bit more review.

    The strategy is to push kuid_t and kgid_t values are far down into
    subsystems and filesystems as reasonable. Leaving the make_kuid and
    from_kuid operations to happen at the edge of userspace, as the values
    come off the disk, and as the values come in from the network.
    Letting compile type incompatible compile errors (present when user
    namespaces are enabled) guide me to find the issues.

    The most tricky areas have been the places where we had an implicit
    union of uid and gid values and were storing them in an unsigned int.
    Those places were converted into explicit unions. I made certain to
    handle those places with simple trivial patches.

    Out of that work I discovered we have generic interfaces for storing
    quota by projid. I had never heard of the project identifiers before.
    Adding full user namespace support for project identifiers accounts
    for most of the code size growth in my git tree.

    Ultimately there will be work to relax privlige checks from
    "capable(FOO)" to "ns_capable(user_ns, FOO)" where it is safe allowing
    root in a user names to do those things that today we only forbid to
    non-root users because it will confuse suid root applications.

    While I was pushing kuid_t and kgid_t changes deep into the audit code
    I made a few other cleanups. I capitalized on the fact we process
    netlink messages in the context of the message sender. I removed
    usage of NETLINK_CRED, and started directly using current->tty.

    Some of these patches have also made it into maintainer trees, with no
    problems from identical code from different trees showing up in
    linux-next.

    After reading through all of this code I feel like I might be able to
    win a game of kernel trivial pursuit."

    Fix up some fairly trivial conflicts in netfilter uid/git logging code.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (107 commits)
    userns: Convert the ufs filesystem to use kuid/kgid where appropriate
    userns: Convert the udf filesystem to use kuid/kgid where appropriate
    userns: Convert ubifs to use kuid/kgid
    userns: Convert squashfs to use kuid/kgid where appropriate
    userns: Convert reiserfs to use kuid and kgid where appropriate
    userns: Convert jfs to use kuid/kgid where appropriate
    userns: Convert jffs2 to use kuid and kgid where appropriate
    userns: Convert hpfs to use kuid and kgid where appropriate
    userns: Convert btrfs to use kuid/kgid where appropriate
    userns: Convert bfs to use kuid/kgid where appropriate
    userns: Convert affs to use kuid/kgid wherwe appropriate
    userns: On alpha modify linux_to_osf_stat to use convert from kuids and kgids
    userns: On ia64 deal with current_uid and current_gid being kuid and kgid
    userns: On ppc convert current_uid from a kuid before printing.
    userns: Convert s390 getting uid and gid system calls to use kuid and kgid
    userns: Convert s390 hypfs to use kuid and kgid where appropriate
    userns: Convert binder ipc to use kuids
    userns: Teach security_path_chown to take kuids and kgids
    userns: Add user namespace support to IMA
    userns: Convert EVM to deal with kuids and kgids in it's hmac computation
    ...

    Linus Torvalds
     

21 Sep, 2012

1 commit


15 Aug, 2012

1 commit

  • The BKL push-down for reiserfs made lock recursion a special case that needs
    to be handled explicitly. One of the cases that was unhandled is dropping
    the quota during inode eviction. Both reiserfs_evict_inode and
    reiserfs_write_dquot take the write lock, but when the journal lock is
    taken it only drops one the references. The locking rules are that the journal
    lock be acquired before the write lock so leaving the reference open leads
    to a ABBA deadlock.

    This patch pushes the unlock up before clear_inode and avoids the recursive
    locking.

    Another ABBA situation can occur when the write lock is dropped while reading
    the bitmap buffer while in the quota code. When the lock is reacquired, it
    will deadlock against dquot->dq_lock and dqopt->dqio_mutex in the dquot_acquire
    path. It's safe to retain the lock across the read and should be cached under
    write load.

    Signed-off-by: Jeff Mahoney
    Signed-off-by: Jan Kara

    Jeff Mahoney
     

30 May, 2012

1 commit

  • pass inode + parent's inode or NULL instead of dentry + bool saying
    whether we want the parent or not.

    NOTE: that needs ceph fix folded in.

    Signed-off-by: Al Viro

    Al Viro
     

06 May, 2012

1 commit

  • After we moved inode_sync_wait() from end_writeback() it doesn't make sense
    to call the function end_writeback() anymore. Rename it to clear_inode()
    which well says what the function really does - set I_CLEAR flag.

    Signed-off-by: Jan Kara
    Signed-off-by: Fengguang Wu

    Jan Kara
     

21 Mar, 2012

3 commits


04 Jan, 2012

1 commit


02 Nov, 2011

2 commits


26 Jul, 2011

1 commit


21 Jul, 2011

2 commits

  • Simple filesystems always pass inode->i_sb_bdev as the block device
    argument, and never need a end_io handler. Let's simply things for
    them and for my grepping activity by dropping these arguments. The
    only thing not falling into that scheme is ext4, which passes and
    end_io handler without needing special flags (yet), but given how
    messy the direct I/O code there is use of __blockdev_direct_IO
    in one instead of two out of three cases isn't going to make a large
    difference anyway.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Let filesystems handle waiting for direct I/O requests themselves instead
    of doing it beforehand. This means filesystem-specific locks to prevent
    new dio referenes from appearing can be held. This is important to allow
    generalizing i_dio_count to non-DIO_LOCKING filesystems.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

25 Mar, 2011

1 commit

  • * 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits)
    Documentation/iostats.txt: bit-size reference etc.
    cfq-iosched: removing unnecessary think time checking
    cfq-iosched: Don't clear queue stats when preempt.
    blk-throttle: Reset group slice when limits are changed
    blk-cgroup: Only give unaccounted_time under debug
    cfq-iosched: Don't set active queue in preempt
    block: fix non-atomic access to genhd inflight structures
    block: attempt to merge with existing requests on plug flush
    block: NULL dereference on error path in __blkdev_get()
    cfq-iosched: Don't update group weights when on service tree
    fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
    block: Require subsystems to explicitly allocate bio_set integrity mempool
    jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    fs: make fsync_buffers_list() plug
    mm: make generic_writepages() use plugging
    blk-cgroup: Add unaccounted time to timeslice_used.
    block: fixup plugging stubs for !CONFIG_BLOCK
    block: remove obsolete comments for blkdev_issue_zeroout.
    blktrace: Use rq->cmd_flags directly in blk_add_trace_rq.
    ...

    Fix up conflicts in fs/{aio.c,super.c}

    Linus Torvalds
     

14 Mar, 2011

1 commit

  • The exportfs encode handle function should return the minimum required
    handle size. This helps user to find out the handle size by passing 0
    handle size in the first step and then redoing to the call again with
    the returned handle size value.

    Acked-by: Serge Hallyn
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Al Viro

    Aneesh Kumar K.V
     

10 Mar, 2011

1 commit

  • Code has been converted over to the new explicit on-stack plugging,
    and delay users have been converted to use the new API for that.
    So lets kill off the old plugging along with aops->sync_page().

    Signed-off-by: Jens Axboe

    Jens Axboe
     

18 Nov, 2010

1 commit


27 Oct, 2010

2 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits)
    split invalidate_inodes()
    fs: skip I_FREEING inodes in writeback_sb_inodes
    fs: fold invalidate_list into invalidate_inodes
    fs: do not drop inode_lock in dispose_list
    fs: inode split IO and LRU lists
    fs: switch bdev inode bdi's correctly
    fs: fix buffer invalidation in invalidate_list
    fsnotify: use dget_parent
    smbfs: use dget_parent
    exportfs: use dget_parent
    fs: use RCU read side protection in d_validate
    fs: clean up dentry lru modification
    fs: split __shrink_dcache_sb
    fs: improve DCACHE_REFERENCED usage
    fs: use percpu counter for nr_dentry and nr_dentry_unused
    fs: simplify __d_free
    fs: take dcache_lock inside __d_path
    fs: do not assign default i_ino in new_inode
    fs: introduce a per-cpu last_ino allocator
    new helper: ihold()
    ...

    Linus Torvalds
     
  • This removes more dead code that was somehow missed by commit 0d99519efef
    (writeback: remove unused nonblocking and congestion checks). There are
    no behavior change except for the removal of two entries from one of the
    ext4 tracing interface.

    The nonblocking checks in ->writepages are no longer used because the
    flusher now prefer to block on get_request_wait() than to skip inodes on
    IO congestion. The latter will lead to more seeky IO.

    The nonblocking checks in ->writepage are no longer used because it's
    redundant with the WB_SYNC_NONE check.

    We no long set ->nonblocking in VM page out and page migration, because
    a) it's effectively redundant with WB_SYNC_NONE in current code
    b) it's old semantic of "Don't get stuck on request queues" is mis-behavior:
    that would skip some dirty inodes on congestion and page out others, which
    is unfair in terms of LRU age.

    Inspired by Christoph Hellwig. Thanks!

    Signed-off-by: Wu Fengguang
    Cc: Theodore Ts'o
    Cc: David Howells
    Cc: Sage Weil
    Cc: Steve French
    Cc: Chris Mason
    Cc: Jens Axboe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     

26 Oct, 2010

1 commit

  • __block_write_begin and block_prepare_write are identical except for slightly
    different calling conventions. Convert all callers to the __block_write_begin
    calling conventions and drop block_prepare_write.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

18 Aug, 2010

1 commit


11 Aug, 2010

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (96 commits)
    no need for list_for_each_entry_safe()/resetting with superblock list
    Fix sget() race with failing mount
    vfs: don't hold s_umount over close_bdev_exclusive() call
    sysv: do not mark superblock dirty on remount
    sysv: do not mark superblock dirty on mount
    btrfs: remove junk sb_dirt change
    BFS: clean up the superblock usage
    AFFS: wait for sb synchronization when needed
    AFFS: clean up dirty flag usage
    cifs: truncate fallout
    mbcache: fix shrinker function return value
    mbcache: Remove unused features
    add f_flags to struct statfs(64)
    pass a struct path to vfs_statfs
    update VFS documentation for method changes.
    All filesystems that need invalidate_inode_buffers() are doing that explicitly
    convert remaining ->clear_inode() to ->evict_inode()
    Make ->drop_inode() just return whether inode needs to be dropped
    fs/inode.c:clear_inode() is gone
    fs/inode.c:evict() doesn't care about delete vs. non-delete paths now
    ...

    Fix up trivial conflicts in fs/nilfs2/super.c

    Linus Torvalds
     

10 Aug, 2010

5 commits

  • Signed-off-by: Al Viro

    Al Viro
     
  • Make sure we call inode_change_ok before doing any changes in ->setattr,
    and make sure to call it even if our fs wants to ignore normal UNIX
    permissions, but use the ATTR_FORCE to skip those.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Replace inode_setattr with opencoded variants of it in all callers. This
    moves the remaining call to vmtruncate into the filesystem methods where it
    can be replaced with the proper truncate sequence.

    In a few cases it was obvious that we would never end up calling vmtruncate
    so it was left out in the opencoded variant:

    spufs: explicitly checks for ATTR_SIZE earlier
    btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier
    ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above

    In addition to that ncpfs called inode_setattr with handcrafted iattrs,
    which allowed to trim down the opencoded variant.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Split up the block_write_begin implementation - __block_write_begin is a new
    trivial wrapper for block_prepare_write that always takes an already
    allocated page and can be either called from block_write_begin or filesystem
    code that already has a page allocated. Remove the handling of already
    allocated pages from block_write_begin after switching all callers that
    do it to __block_write_begin.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Move the call to vmtruncate to get rid of accessive blocks to the callers
    in prepearation of the new truncate calling sequence. This was only done
    for DIO_LOCKING filesystems, so the __blockdev_direct_IO_newtrunc variant
    was not needed anyway. Get rid of blockdev_direct_IO_no_locking and
    its _newtrunc variant while at it as just opencoding the two additional
    paramters is shorted than the name suffix.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig