20 Mar, 2014

1 commit


18 Jan, 2014

1 commit

  • Pull namespace fixes from Eric Biederman:
    "This is a set of 3 regression fixes.

    This fixes /proc/mounts when using "ip netns add " to display
    the actual mount point.

    This fixes a regression in clone that broke lxc-attach.

    This fixes a regression in the permission checks for mounting /proc
    that made proc unmountable if binfmt_misc was in use. Oops.

    My apologies for sending this pull request so late. Al Viro gave
    interesting review comments about the d_path fix that I wanted to
    address in detail before I sent this pull request. Unfortunately a
    bad round of colds kept from addressing that in detail until today.
    The executive summary of the review was:

    Al: Is patching d_path really sufficient?
    The prepend_path, d_path, d_absolute_path, and __d_path family of
    functions is a really mess.

    Me: Yes, patching d_path is really sufficient. Yes, the code is mess.
    No it is not appropriate to rewrite all of d_path for a regression
    that has existed for entirely too long already, when a two line
    change will do"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    vfs: Fix a regression in mounting proc
    fork: Allow CLONE_PARENT after setns(CLONE_NEWPID)
    vfs: In d_path don't call d_dname on a mount point

    Linus Torvalds
     

16 Jan, 2014

1 commit


15 Jan, 2014

1 commit

  • There is a bug in the function nilfs_segctor_collect, which results in
    active data being written to a segment, that is marked as clean. It is
    possible, that this segment is selected for a later segment
    construction, whereby the old data is overwritten.

    The problem shows itself with the following kernel log message:

    nilfs_sufile_do_cancel_free: segment 6533 must be clean

    Usually a few hours later the file system gets corrupted:

    NILFS: bad btree node (blocknr=8748107): level = 0, flags = 0x0, nchildren = 0
    NILFS error (device sdc1): nilfs_bmap_last_key: broken bmap (inode number=114660)

    The issue can be reproduced with a file system that is nearly full and
    with the cleaner running, while some IO intensive task is running.
    Although it is quite hard to reproduce.

    This is what happens:

    1. The cleaner starts the segment construction
    2. nilfs_segctor_collect is called
    3. sc_stage is on NILFS_ST_SUFILE and segments are freed
    4. sc_stage is on NILFS_ST_DAT current segment is full
    5. nilfs_segctor_extend_segments is called, which
    allocates a new segment
    6. The new segment is one of the segments freed in step 3
    7. nilfs_sufile_cancel_freev is called and produces an error message
    8. Loop around and the collection starts again
    9. sc_stage is on NILFS_ST_SUFILE and segments are freed
    including the newly allocated segment, which will contain active
    data and can be allocated at a later time
    10. A few hours later another segment construction allocates the
    segment and causes file system corruption

    This can be prevented by simply reordering the statements. If
    nilfs_sufile_cancel_freev is called before nilfs_segctor_extend_segments
    the freed segments are marked as dirty and cannot be allocated any more.

    Signed-off-by: Andreas Rohner
    Reviewed-by: Ryusuke Konishi
    Tested-by: Andreas Rohner
    Signed-off-by: Ryusuke Konishi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andreas Rohner
     

11 Jan, 2014

3 commits

  • Pull xfs bugfixes from Ben Myers:
    "Here we have a bugfix for an off-by-one in the remote attribute
    verifier that results in a forced shutdown which you can hit with v5
    superblock by creating a 64k xattr, and a fix for a missing
    destroy_work_on_stack() in the allocation worker.

    It's a bit late, but they are both fairly straightforward"

    * tag 'xfs-for-linus-v3.13-rc8' of git://oss.sgi.com/xfs/xfs:
    xfs: Calling destroy_work_on_stack() to pair with INIT_WORK_ONSTACK()
    xfs: fix off-by-one error in xfs_attr3_rmt_verify

    Linus Torvalds
     
  • In case CONFIG_DEBUG_OBJECTS_WORK is defined, it is needed to
    call destroy_work_on_stack() which frees the debug object to pair
    with INIT_WORK_ONSTACK().

    Signed-off-by: Liu, Chuansheng
    Reviewed-by: Ben Myers
    Signed-off-by: Ben Myers

    (cherry picked from commit 6f96b3063cdd473c68664a190524ed966ac0cd92)

    Chuansheng Liu
     
  • With CRC check is enabled, if trying to set an attributes value just
    equal to the maximum size of XATTR_SIZE_MAX would cause the v3 remote
    attr write verification procedure failure, which would yield the back
    trace like below:

    XFS (sda7): Internal error xfs_attr3_rmt_write_verify at line 191 of file fs/xfs/xfs_attr_remote.c

    Call Trace:
    [] dump_stack+0x45/0x56
    [] xfs_error_report+0x3b/0x40 [xfs]
    [] ? _xfs_buf_ioapply+0x6d/0x390 [xfs]
    [] xfs_corruption_error+0x55/0x80 [xfs]
    [] xfs_attr3_rmt_write_verify+0x14b/0x1a0 [xfs]
    [] ? _xfs_buf_ioapply+0x6d/0x390 [xfs]
    [] ? xfs_bdstrat_cb+0x55/0xb0 [xfs]
    [] _xfs_buf_ioapply+0x6d/0x390 [xfs]
    [] ? vm_map_ram+0x31a/0x460
    [] ? wake_up_state+0x20/0x20
    [] ? xfs_bdstrat_cb+0x55/0xb0 [xfs]
    [] xfs_buf_iorequest+0x6b/0xc0 [xfs]
    [] xfs_bdstrat_cb+0x55/0xb0 [xfs]
    [] xfs_bwrite+0x46/0x80 [xfs]
    [] xfs_attr_rmtval_set+0x334/0x490 [xfs]
    [] xfs_attr_leaf_addname+0x24a/0x410 [xfs]
    [] xfs_attr_set_int+0x223/0x470 [xfs]
    [] xfs_attr_set+0x96/0xb0 [xfs]
    [] xfs_xattr_set+0x42/0x70 [xfs]
    [] generic_setxattr+0x62/0x80
    [] __vfs_setxattr_noperm+0x63/0x1b0
    [] ? evm_inode_setxattr+0xe/0x10
    [] vfs_setxattr+0xb5/0xc0
    [] setxattr+0x12e/0x1c0
    [] ? final_putname+0x22/0x50
    [] ? putname+0x2b/0x40
    [] ? user_path_at_empty+0x5f/0x90
    [] ? __sb_start_write+0x49/0xe0
    [] ? vm_mmap_pgoff+0x99/0xc0
    [] SyS_setxattr+0x8f/0xe0
    [] system_call_fastpath+0x1a/0x1f

    Tests:
    setfattr -n user.longxattr -v `perl -e 'print "A"x65536'` testfile

    This patch fix it to check the remote EA size is greater than the
    XATTR_SIZE_MAX rather than more than or equal to it, because it's
    valid if the specified EA value size is equal to the limitation as
    per VFS setxattr interface.

    Signed-off-by: Jie Liu
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    (cherry picked from commit 85dd0707f0cad26d60f2dc574d17a5ab948d10f7)

    Jie Liu
     

07 Jan, 2014

2 commits

  • Pull ext4 bugfix from Ted Ts'o:
    "Fix a regression introduced in v3.13-rc6"

    * tag 'ext4_for_linus_stable' of http://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    ext4: fix bigalloc regression

    Linus Torvalds
     
  • Commit f5a44db5d2 introduced a regression on filesystems created with
    the bigalloc feature (cluster size > blocksize). It causes xfstests
    generic/006 and /013 to fail with an unexpected JBD2 failure and
    transaction abort that leaves the test file system in a read only state.
    Other xfstests run on bigalloc file systems are likely to fail as well.

    The cause is the accidental use of a cluster mask where a cluster
    offset was needed in ext4_ext_map_blocks().

    Signed-off-by: Eric Whitney

    Eric Whitney
     

03 Jan, 2014

3 commits

  • Merge patches from Andrew Morton:
    "Ten fixes"

    * emailed patches from Andrew Morton :
    epoll: do not take the nested ep->mtx on EPOLL_CTL_DEL
    sh: add EXPORT_SYMBOL(min_low_pfn) and EXPORT_SYMBOL(max_low_pfn) to sh_ksyms_32.c
    drivers/dma/ioat/dma.c: check DMA mapping error in ioat_dma_self_test()
    mm/memory-failure.c: transfer page count from head page to tail page after split thp
    MAINTAINERS: set up proper record for Xilinx Zynq
    mm: remove bogus warning in copy_huge_pmd()
    memcg: fix memcg_size() calculation
    mm: fix use-after-free in sys_remap_file_pages
    mm: munlock: fix deadlock in __munlock_pagevec()
    mm: munlock: fix a bug where THP tail page is encountered

    Linus Torvalds
     
  • The EPOLL_CTL_DEL path of epoll contains a classic, ab-ba deadlock.
    That is, epoll_ctl(a, EPOLL_CTL_DEL, b, x), will deadlock with
    epoll_ctl(b, EPOLL_CTL_DEL, a, x). The deadlock was introduced with
    commmit 67347fe4e632 ("epoll: do not take global 'epmutex' for simple
    topologies").

    The acquistion of the ep->mtx for the destination 'ep' was added such
    that a concurrent EPOLL_CTL_ADD operation would see the correct state of
    the ep (Specifically, the check for '!list_empty(&f.file->f_ep_links')

    However, by simply not acquiring the lock, we do not serialize behind
    the ep->mtx from the add path, and thus may perform a full path check
    when if we had waited a little longer it may not have been necessary.
    However, this is a transient state, and performing the full loop
    checking in this case is not harmful.

    The important point is that we wouldn't miss doing the full loop
    checking when required, since EPOLL_CTL_ADD always locks any 'ep's that
    its operating upon. The reason we don't need to do lock ordering in the
    add path, is that we are already are holding the global 'epmutex'
    whenever we do the double lock. Further, the original posting of this
    patch, which was tested for the intended performance gains, did not
    perform this additional locking.

    Signed-off-by: Jason Baron
    Cc: Nathan Zimmer
    Cc: Eric Wong
    Cc: Nelson Elhage
    Cc: Al Viro
    Cc: Davide Libenzi
    Cc: "Paul E. McKenney"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason Baron
     
  • Pull GFS2 fixes from Steven Whitehouse:
    "Here is a set of small fixes for GFS2. There is a fix to drop
    s_umount which is copied in from the core vfs, two patches relate to a
    hard to hit "use after free" and memory leak. Two patches related to
    using DIO and buffered I/O on the same file to ensure correct
    operation in relation to glock state changes. The final patch adds an
    RCU read lock to ensure correct locking on an error path"

    * tag 'gfs2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes:
    GFS2: Fix unsafe dereference in dump_holder()
    GFS2: Wait for async DIO in glock state changes
    GFS2: Fix incorrect invalidation for DIO/buffered I/O
    GFS2: Fix slab memory leak in gfs2_bufdata
    GFS2: Fix use-after-free race when calling gfs2_remove_from_ail
    GFS2: don't hold s_umount over blkdev_put

    Linus Torvalds
     

02 Jan, 2014

1 commit


28 Dec, 2013

3 commits

  • Set FILE_CREATED on O_CREAT|O_EXCL.

    cifs code didn't change during commit 116cc0225381415b96551f725455d067f63a76a0

    Kernel bugzilla 66251

    Signed-off-by: Shirish Pargaonkar
    Acked-by: Jeff Layton
    CC: Stable
    Signed-off-by: Steve French

    Shirish Pargaonkar
     
  • When we obtain tcon from cifs_sb, we use cifs_sb_tlink() to first obtain
    tlink which also grabs a reference to it. We do not drop this reference
    to tlink once we are done with the call.

    The patch fixes this issue by instead passing tcon as a parameter and
    avoids having to obtain a reference to the tlink. A lookup for the tcon
    is already made in the calling functions and this way we avoid having to
    re-run the lookup. This is also consistent with the argument list for
    other similar calls for M-F symlinks.

    We should also return an ENOSYS when we do not find a protocol specific
    function to lookup the MF Symlink data.

    Signed-off-by: Sachin Prabhu
    Reviewed-by: Jeff Layton
    CC: Stable
    Signed-off-by: Steve French

    Sachin Prabhu
     
  • Signed-off-by: Steve French
    Signed-off-by: Gregor Beck
    Reviewed-by: Jeff Layton

    Steve French
     

27 Dec, 2013

1 commit

  • Pull ext4 fixes from Ted Ts'o:
    "A collection of bug fixes destined for stable and some printk cleanups
    and a patch so that instead of BUG'ing we use the ext4_error()
    framework to mark the file system is corrupted"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    ext4: add explicit casts when masking cluster sizes
    ext4: fix deadlock when writing in ENOSPC conditions
    jbd2: rename obsoleted msg JBD->JBD2
    jbd2: revise KERN_EMERG error messages
    jbd2: don't BUG but return ENOSPC if a handle runs out of space
    ext4: Do not reserve clusters when fs doesn't support extents
    ext4: fix del_timer() misuse for ->s_err_report
    ext4: check for overlapping extents in ext4_valid_extent_entries()
    ext4: fix use-after-free in ext4_mb_new_blocks
    ext4: call ext4_error_inode() if jbd2_journal_dirty_metadata() fails

    Linus Torvalds
     

24 Dec, 2013

1 commit


23 Dec, 2013

2 commits

  • Pull AIO leak fixes from Ben LaHaise:
    "I've put these two patches plus Linus's change through a round of
    tests, and it passes millions of iterations of the aio numa
    migratepage test, as well as a number of repetitions of a few simple
    read and write tests.

    The first patch fixes the memory leak Kent introduced, while the
    second patch makes aio_migratepage() much more paranoid and robust"

    * git://git.kvack.org/~bcrl/aio-next:
    aio/migratepages: make aio migrate pages sane
    aio: fix kioctx leak introduced by "aio: Fix a trinity splat"

    Linus Torvalds
     
  • Since commit 36bc08cc01709 ("fs/aio: Add support to aio ring pages
    migration") the aio ring setup code has used a special per-ring backing
    inode for the page allocations, rather than just using random anonymous
    pages.

    However, rather than remembering the pages as it allocated them, it
    would allocate the pages, insert them into the file mapping (dirty, so
    that they couldn't be free'd), and then forget about them. And then to
    look them up again, it would mmap the mapping, and then use
    "get_user_pages()" to get back an array of the pages we just created.

    Now, not only is that incredibly inefficient, it also leaked all the
    pages if the mmap failed (which could happen due to excessive number of
    mappings, for example).

    So clean it all up, making it much more straightforward. Also remove
    some left-overs of the previous (broken) mm_populate() usage that was
    removed in commit d6c355c7dabc ("aio: fix race in ring buffer page
    lookup introduced by page migration support") but left the pointless and
    now misleading MAP_POPULATE flag around.

    Tested-and-acked-by: Benjamin LaHaise
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

22 Dec, 2013

2 commits

  • The arbitrary restriction on page counts offered by the core
    migrate_page_move_mapping() code results in rather suspicious looking
    fiddling with page reference counts in the aio_migratepage() operation.
    To fix this, make migrate_page_move_mapping() take an extra_count parameter
    that allows aio to tell the code about its own reference count on the page
    being migrated.

    While cleaning up aio_migratepage(), make it validate that the old page
    being passed in is actually what aio_migratepage() expects to prevent
    misbehaviour in the case of races.

    Signed-off-by: Benjamin LaHaise

    Benjamin LaHaise
     
  • e34ecee2ae791df674dfb466ce40692ca6218e43 reworked the percpu reference
    counting to correct a bug trinity found. Unfortunately, the change lead
    to kioctxes being leaked because there was no final reference count to
    put. Add that reference count back in to fix things.

    Signed-off-by: Benjamin LaHaise
    Cc: stable@vger.kernel.org

    Benjamin LaHaise
     

21 Dec, 2013

2 commits

  • Pull xfs bugfixes from Ben Myers:
    "This contains fixes for some asserts
    related to project quotas, a memory leak, a hang when disabling group or
    project quotas before disabling user quotas, Dave's email address, several
    fixes for the alignment of file allocation to stripe unit/width geometry, a
    fix for an assertion with xfs_zero_remaining_bytes, and the behavior of
    metadata writeback in the face of IO errors.

    Details:
    - fix memory leak in xfs_dir2_node_removename
    - fix quota assertion in xfs_setattr_size
    - fix quota assertions in xfs_qm_vop_create_dqattach
    - fix for hang when disabling group and project quotas before
    disabling user quotas
    - fix Dave Chinner's email address in MAINTAINERS
    - fix for file allocation alignment
    - fix for assertion in xfs_buf_stale by removing xfsbdstrat
    - fix for alignment with swalloc mount option
    - fix for "retry forever" semantics on IO errors"

    * tag 'xfs-for-linus-v3.13-rc5' of git://oss.sgi.com/xfs/xfs:
    xfs: abort metadata writeback on permanent errors
    xfs: swalloc doesn't align allocations properly
    xfs: remove xfsbdstrat error
    xfs: align initial file allocations correctly
    MAINTAINERS: fix incorrect mail address of XFS maintainer
    xfs: fix infinite loop by detaching the group/project hints from user dquot
    xfs: fix assertion failure at xfs_setattr_nonsize
    xfs: fix false assertion at xfs_qm_vop_create_dqattach
    xfs: fix memory leak in xfs_dir2_node_removename

    Linus Torvalds
     
  • Some pstore backing devices use on board flash as persistent
    storage. These have limited numbers of write cycles so it
    is a poor idea to use them from high frequency operations.

    Signed-off-by: Tony Luck
    Signed-off-by: Linus Torvalds

    Luck, Tony
     

20 Dec, 2013

3 commits

  • The missing casts can cause the high 64-bits of the physical blocks to
    be lost. Set up new macros which allows us to make sure the right
    thing happen, even if at some point we end up supporting larger
    logical block numbers.

    Thanks to the Emese Revfy and the PaX security team for reporting this
    issue.

    Reported-by: PaX Team
    Reported-by: Emese Revfy
    Signed-off-by: "Theodore Ts'o"
    Cc: stable@vger.kernel.org

    Theodore Ts'o
     
  • We need to wait for any outstanding DIO to complete in a couple
    of situations. Firstly, in case we are changing out of deferred
    mode (in inode_go_sync) where GLF_DIRTY will not be set. That
    call could be prefixed with a test for gl_state == LM_ST_DEFERRED
    but it doesn't seem worth it bearing in mind that the test for
    outstanding DIO is very quick anyway, in the usual case that there
    is none.

    The second case is in inode_go_lock which will catch the cases
    where we have a cached EX lock, but where we grant deferred locks
    against it so that there is no glock state transistion. We only
    need to wait if the state is not deferred, since DIO is valid
    anyway in that state.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • In patch 209806aba9d540dde3db0a5ce72307f85f33468f we allowed
    local deferred locks to be granted against a cached exclusive
    lock. That opened up a corner case which this patch now
    fixes.

    The solution to the problem is to check whether we have cached
    pages each time we do direct I/O and if so to unmap, flush
    and invalidate those pages. Since the glock state machine
    normally does that for us, mostly the code will be a no-op.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

19 Dec, 2013

1 commit


18 Dec, 2013

2 commits

  • Akira-san has been reporting rare deadlocks of his machine when running
    xfstests test 269 on ext4 filesystem. The problem turned out to be in
    ext4_da_reserve_metadata() and ext4_da_reserve_space() which called
    ext4_should_retry_alloc() while holding i_data_sem. Since
    ext4_should_retry_alloc() can force a transaction commit, this is a
    lock ordering violation and leads to deadlocks.

    Fix the problem by just removing the retry loops. These functions should
    just report ENOSPC to the caller (e.g. ext4_da_write_begin()) and that
    function must take care of retrying after dropping all necessary locks.

    Reported-and-tested-by: Akira Fujita
    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"
    Cc: stable@vger.kernel.org

    Jan Kara
     
  • Pull two Ceph fixes from Sage Weil:
    "One of these is fixing a regression from the d_flags file type patch
    that went into -rc1 that broke instantiation of inodes and dentries
    (we were doing dentries first). The other is just an off-by-one
    corner case"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
    ceph: Avoid data inconsistency due to d-cache aliasing in readpage()
    ceph: initialize inode before instantiating dentry

    Linus Torvalds
     

17 Dec, 2013

8 commits

  • If we are doing aysnc writeback of metadata, we can get write errors
    but have nobody to report them to. At the moment, we simply attempt
    to reissue the write from io completion in the hope that it's a
    transient error.

    When it's not a transient error, the buffer is stuck forever in
    this loop, and we cannot break out of it. Eventually, unmount will
    hang because the AIL cannot be emptied and everything goes downhill
    from them.

    To solve this problem, only retry the write IO once before aborting
    it. We don't throw the buffer away because some transient errors can
    last minutes (e.g. FC path failover) or even hours (thin
    provisioned devices that have run out of backing space) before they
    go away. Hence we really want to keep trying until we can't try any
    more.

    Because the buffer was not cleaned, however, it does not get removed
    from the AIL and hence the next pass across the AIL will start IO on
    it again. As such, we still get the "retry forever" semantics that
    we currently have, but we allow other access to the buffer in the
    mean time. Meanwhile the filesystem can continue to modify the
    buffer and relog it, so the IO errors won't hang the log or the
    filesystem.

    Now when we are pushing the AIL, we can see all these "permanent IO
    error" buffers and we can issue a warning about failures before we
    retry the IO. We can also catch these buffers when unmounting an
    issue a corruption warning, too.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • When swalloc is specified as a mount option, allocations are
    supposed to be aligned to the stripe width rather than the stripe
    unit of the underlying filesystem. However, it does not do this.

    What the implementation does is round up the allocation size to a
    stripe width, hence ensuring that all allocations span a full stripe
    width. It does not, however, ensure that that allocation is aligned
    to a stripe width, and hence the allocations can span multiple
    underlying stripes and so still see RMW cycles for things like
    direct IO on MD RAID.

    So, if the swalloc mount option is set, change the allocation
    alignment in xfs_bmap_btalloc() to use the stripe width rather than
    the stripe unit.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Ben Myers
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • The xfsbdstrat helper is a small but useless wrapper for xfs_buf_iorequest that
    handles the case of a shut down filesystem. Most of the users have private,
    uncached buffers that can just be freed in this case, but the complex error
    handling in xfs_bioerror_relse messes up the case when it's called without
    a locked buffer.

    Remove xfsbdstrat and opencode the error handling in the callers. All but
    one can simply return an error and don't need to deal with buffer state,
    and the one caller that cares about the buffer state could do with a major
    cleanup as well, but we'll defer that to later.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ben Myers
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • The function xfs_bmap_isaeof() is used to indicate that an
    allocation is occurring at or past the end of file, and as such
    should be aligned to the underlying storage geometry if possible.

    Commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") changed the
    behaviour of this function for empty files - it turned off
    allocation alignment for this case accidentally. Hence large initial
    allocations from direct IO are not getting correctly aligned to the
    underlying geometry, and that is cause write performance to drop in
    alignment sensitive configurations.

    Fix it by considering allocation into empty files as requiring
    aligned allocation again.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    (cherry picked from commit f9b395a8ef8f34d19cae2cde361e19c96e097fad)

    Dave Chinner
     
  • xfs_quota(8) will hang up if trying to turn group/project quota off
    before the user quota is off, this could be 100% reproduced by:
    # mount -ouquota,gquota /dev/sda7 /xfs
    # mkdir /xfs/test
    # xfs_quota -xc 'off -g' /xfs /proc/sysrq-trigger
    # dmesg

    SysRq : Show Blocked State
    task PC stack pid father
    xfs_quota D 0000000000000000 0 27574 2551 0x00000000
    [snip]
    Call Trace:
    [] schedule+0xad/0xc0
    [] schedule_timeout+0x35e/0x3c0
    [] ? mark_held_locks+0x176/0x1c0
    [] ? call_timer_fn+0x2c0/0x2c0
    [] ? xfs_qm_shrink_count+0x30/0x30 [xfs]
    [] schedule_timeout_uninterruptible+0x26/0x30
    [] xfs_qm_dquot_walk+0x235/0x260 [xfs]
    [] ? xfs_perag_get+0x1d8/0x2d0 [xfs]
    [] ? xfs_perag_get+0x5/0x2d0 [xfs]
    [] ? xfs_inode_ag_iterator+0xae/0xf0 [xfs]
    [] ? xfs_trans_free_dqinfo+0x50/0x50 [xfs]
    [] ? xfs_inode_ag_iterator+0xcf/0xf0 [xfs]
    [] xfs_qm_dqpurge_all+0x66/0xb0 [xfs]
    [] xfs_qm_scall_quotaoff+0x20a/0x5f0 [xfs]
    [] xfs_fs_set_xstate+0x136/0x180 [xfs]
    [] do_quotactl+0x53a/0x6b0
    [] ? iput+0x5b/0x90
    [] SyS_quotactl+0x167/0x1d0
    [] ? trace_hardirqs_on_thunk+0x3a/0x3f
    [] system_call_fastpath+0x16/0x1b

    It's fine if we turn user quota off at first, then turn off other
    kind of quotas if they are enabled since the group/project dquot
    refcount is decreased to zero once the user quota if off. Otherwise,
    those dquots refcount is non-zero due to the user dquot might refer
    to them as hint(s). Hence, above operation cause an infinite loop
    at xfs_qm_dquot_walk() while trying to purge dquot cache.

    This problem has been around since Linux 3.4, it was introduced by:
    [ b84a3a9675 xfs: remove the per-filesystem list of dquots ]

    Originally we will release the group dquot pointers because the user
    dquots maybe carrying around as a hint via xfs_qm_detach_gdquots().
    However, with above change, there is no such work to be done before
    purging group/project dquot cache.

    In order to solve this problem, this patch introduces a special routine
    xfs_qm_dqpurge_hints(), and it would release the group/project dquot
    pointers the user dquots maybe carrying around as a hint, and then it
    will proceed to purge the user dquot cache if requested.

    Cc: stable@vger.kernel.org
    Signed-off-by: Jie Liu
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    (cherry picked from commit df8052e7dae00bde6f21b40b6e3e1099770f3afc)

    Jie Liu
     
  • For CRC enabled v5 super block, change a file's ownership can simply
    trigger an ASSERT failure at xfs_setattr_nonsize() if both group and
    project quota are enabled, i.e,

    [ 305.337609] XFS: Assertion failed: !XFS_IS_PQUOTA_ON(mp), file: fs/xfs/xfs_iops.c, line: 621
    [ 305.339250] Kernel BUG at ffffffffa0a7fa32 [verbose debug info unavailable]
    [ 305.383939] Call Trace:
    [ 305.385536] [] xfs_setattr_nonsize+0x69a/0x720 [xfs]
    [ 305.387142] [] xfs_vn_setattr+0x29/0x70 [xfs]
    [ 305.388727] [] notify_change+0x1a8/0x350
    [ 305.390298] [] chown_common+0xfd/0x110
    [ 305.391868] [] SyS_fchownat+0xaf/0x110
    [ 305.393440] [] SyS_lchown+0x20/0x30
    [ 305.394995] [] system_call_fastpath+0x1a/0x1f
    [ 305.399870] RIP [] assfail+0x22/0x30 [xfs]

    This fix adjust the assertion to check if the super block support both
    quota inodes or not.

    Signed-off-by: Jie Liu
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    (cherry picked from commit 5a01dd54f4a7fb513062070c5acef20d13cad980)

    Jie Liu
     
  • After the previous fix, there still has another ASSERT failure if turning
    off any type of quota while fsstress is running at the same time.

    Backtrace in this case:

    [ 50.867897] XFS: Assertion failed: XFS_IS_GQUOTA_ON(mp), file: fs/xfs/xfs_qm.c, line: 2118
    [ 50.867924] ------------[ cut here ]------------
    ...
    [ 50.867957] Kernel BUG at ffffffffa0b55a32 [verbose debug info unavailable]
    [ 50.867999] invalid opcode: 0000 [#1] SMP
    [ 50.869407] Call Trace:
    [ 50.869446] [] xfs_qm_vop_create_dqattach+0x19a/0x2d0 [xfs]
    [ 50.869512] [] xfs_create+0x5c5/0x6a0 [xfs]
    [ 50.869564] [] xfs_vn_mknod+0xac/0x1d0 [xfs]
    [ 50.869615] [] xfs_vn_mkdir+0x16/0x20 [xfs]
    [ 50.869655] [] vfs_mkdir+0x95/0x130
    [ 50.869689] [] SyS_mkdirat+0xaa/0xe0
    [ 50.869723] [] SyS_mkdir+0x19/0x20
    [ 50.869757] [] system_call_fastpath+0x1a/0x1f
    [ 50.869793] Code: 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89
    [ 50.870003] RIP [] assfail+0x22/0x30 [xfs]
    [ 50.870050] RSP
    [ 50.879251] ---[ end trace c93a2b342341c65b ]---

    We're hitting the ASSERT(XFS_IS_*QUOTA_ON(mp)) in xfs_qm_vop_create_dqattach(),
    however the assertion itself is not right IMHO. While performing quota off, we
    firstly clear the XFS_*QUOTA_ACTIVE bit(s) from struct xfs_mount without taking
    any special locks, see xfs_qm_scall_quotaoff(). Hence there is no guarantee
    that the desired quota is still active.

    Signed-off-by: Jie Liu
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    (cherry picked from commit 37eb9706ebf5b99d14c6086cdeef2c2f73f9c9fb)

    Jie Liu
     
  • Fix the leak of kernel memory in xfs_dir2_node_removename()
    when xfs_dir2_leafn_remove() returns an error code.

    Signed-off-by: Mark Tinguely
    Reviewed-by: Ben Myers
    Signed-off-by: Ben Myers

    (cherry picked from commit ef701600fd26cace9d513ee174688a2b83832126)

    Mark Tinguely
     

14 Dec, 2013

2 commits

  • This patch fixes a slab memory leak that sometimes can occur
    for files with a very short lifespan. The problem occurs when
    a dinode is deleted before it has gotten to the journal properly.
    In the leak scenario, the bd object is pinned for journal
    committment (queued to the metadata buffers queue: sd_log_le_buf)
    but is subsequently unpinned and dequeued before it finds its way
    to the ail or the revoke queue. In this rare circumstance, the bd
    object needs to be freed from slab memory, or it is forgotten.
    We have to be very careful how we do it, though, because
    multiple processes can call gfs2_remove_from_journal. In order to
    avoid double-frees, only the process that does the unpinning is
    allowed to free the bd.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     
  • Function gfs2_remove_from_ail drops the reference on the bh via
    brelse. This patch fixes a race condition whereby bh is deferenced
    after the brelse when setting bd->bd_blkno = bh->b_blocknr;
    Under certain rare circumstances, bh might be gone or reused,
    and bd->bd_blkno is set to whatever that memory happens to be,
    which is often 0. Later, in gfs2_trans_add_unrevoke, that bd fails
    the test "bd->bd_blkno >= blkno" which causes it to never be freed.
    The end result is that the bd is never freed from the bufdata cache,
    which results in this error:
    slab error in kmem_cache_destroy(): cache `gfs2_bufdata': Can't free all objects

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson