05 Oct, 2013

4 commits

  • This fixes a build failure caused by calling the free() function which
    does not exist in the Linux kernel.

    Signed-off-by: Thierry Reding
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    (cherry picked from commit aaaae98022efa4f3c31042f1fdf9e7a0c5f04663)

    Thierry Reding
     
  • Free the memory in error path of xlog_recover_add_to_trans().
    Normally this memory is freed in recovery pass2, but is leaked
    in the error path.

    Signed-off-by: Mark Tinguely
    Reviewed-by: Eric Sandeen
    Signed-off-by: Ben Myers

    (cherry picked from commit 519ccb81ac1c8e3e4eed294acf93be00b43dcad6)

    tinguely@sgi.com
     
  • The determination of whether a directory entry contains a dtype
    field originally was dependent on the filesystem having CRCs
    enabled. This meant that the format for dtype beign enabled could be
    determined by checking the directory block magic number rather than
    doing a feature bit check. This was useful in that it meant that we
    didn't need to pass a struct xfs_mount around to functions that
    were already supplied with a directory block header.

    Unfortunately, the introduction of dtype fields into the v4
    structure via a feature bit meant this "use the directory block
    magic number" method of discriminating the dirent entry sizes is
    broken. Hence we need to convert the places that use magic number
    checks to use feature bit checks so that they work correctly and not
    by chance.

    The current code works on v4 filesystems only because the dirent
    size roundup covers the extra byte needed by the dtype field in the
    places where this problem occurs.

    Signed-off-by: Dave Chinner
    Reviewed-by: Ben Myers
    Signed-off-by: Ben Myers

    (cherry picked from commit 367993e7c6428cb7617ab7653d61dca54e2fdede)

    Dave Chinner
     
  • Michael Semon reported that xfs/299 generated this lockdep warning:

    =============================================
    [ INFO: possible recursive locking detected ]
    3.12.0-rc2+ #2 Not tainted
    ---------------------------------------------
    touch/21072 is trying to acquire lock:
    (&xfs_dquot_other_class){+.+...}, at: [] xfs_trans_dqlockedjoin+0x57/0x64

    but task is already holding lock:
    (&xfs_dquot_other_class){+.+...}, at: [] xfs_trans_dqlockedjoin+0x57/0x64

    other info that might help us debug this:
    Possible unsafe locking scenario:

    CPU0
    ----
    lock(&xfs_dquot_other_class);
    lock(&xfs_dquot_other_class);

    *** DEADLOCK ***

    May be due to missing lock nesting notation

    7 locks held by touch/21072:
    #0: (sb_writers#10){++++.+}, at: [] mnt_want_write+0x1e/0x3e
    #1: (&type->i_mutex_dir_key#4){+.+.+.}, at: [] do_last+0x245/0xe40
    #2: (sb_internal#2){++++.+}, at: [] xfs_trans_alloc+0x1f/0x35
    #3: (&(&ip->i_lock)->mr_lock/1){+.+...}, at: [] xfs_ilock+0x100/0x1f1
    #4: (&(&ip->i_lock)->mr_lock){++++-.}, at: [] xfs_ilock_nowait+0x105/0x22f
    #5: (&dqp->q_qlock){+.+...}, at: [] xfs_trans_dqlockedjoin+0x57/0x64
    #6: (&xfs_dquot_other_class){+.+...}, at: [] xfs_trans_dqlockedjoin+0x57/0x64

    The lockdep annotation for dquot lock nesting only understands
    locking for user and "other" dquots, not user, group and quota
    dquots. Fix the annotations to match the locking heirarchy we now
    have.

    Reported-by: Michael L. Semon
    Signed-off-by: Dave Chinner
    Reviewed-by: Ben Myers
    Signed-off-by: Ben Myers

    (cherry picked from commit f112a049712a5c07de25d511c3c6587a2b1a015e)

    Dave Chinner
     

26 Sep, 2013

1 commit

  • Commit f5ea1100 cleans up the disk to host conversions for
    node directory entries, but because a variable is reused in
    xfs_node_toosmall() the next node is not correctly found.
    If the original node is small enough (< BBTOB(bp->b_length),
    file: /root/newest/xfs/fs/xfs/xfs_trans_buf.c, line: 569

    Keep the original node header to get the correct forward node.

    (When a node is considered for a merge with a sibling, it overwrites the
    sibling pointers of the original incore nodehdr with the sibling's
    pointers. This leads to loop considering the original node as a merge
    candidate with itself in the second pass, and so it incorrectly
    determines a merge should occur.)

    Signed-off-by: Mark Tinguely
    Reviewed-by: Ben Myers
    Signed-off-by: Ben Myers

    [v3: added Dave Chinner's (slightly modified) suggestion to the commit header,
    cleaned up whitespace. -bpm]

    Mark Tinguely
     

25 Sep, 2013

4 commits

  • After a fair number of xfstests runs, xfs/182 started to fail
    regularly with a corrupted directory - a directory read verifier was
    failing after recovery because it found a block with a XARM magic
    number (remote attribute block) rather than a directory data block.

    The first time I saw this repeated failure I did /something/ and the
    problem went away, so I was never able to find the underlying
    problem. Test xfs/182 failed again today, and I found the root
    cause before I did /something else/ that made it go away.

    Tracing indicated that the block in question was being correctly
    logged, the log was being flushed by sync, but the buffer was not
    being written back before the shutdown occurred. Tracing also
    indicated that log recovery was also reading the block, but then
    never writing it before log recovery invalidated the cache,
    indicating that it was not modified by log recovery.

    More detailed analysis of the corpse indicated that the filesystem
    had a uuid of "a4131074-1872-4cac-9323-2229adbcb886" but the XARM
    block had a uuid of "8f32f043-c3c9-e7f8-f947-4e7f989c05d3", which
    indicated it was a block from an older filesystem. The reason that
    log recovery didn't replay it was that the LSN in the XARM block was
    larger than the LSN of the transaction being replayed, and so the
    block was not overwritten by log recovery.

    Hence, log recovery cant blindly trust the magic number and LSN in
    the block - it must verify that it belongs to the filesystem being
    recovered before using the LSN. i.e. if the UUIDs don't match, we
    need to unconditionally recovery the change held in the log.

    This patch was first tested on a block device that was repeatedly
    causing xfs/182 to fail with the same failure on the same block with
    the same directory read corruption signature (i.e. XARM block). It
    did not fail, and hasn't failed since.

    Signed-off-by: Dave Chinner
    Reviewed-by: Ben Myers
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • It uses a kernel internal structure in it's definition rather than
    the user visible structure that is passed to the ioctl.

    Signed-off-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • When we free an inode, we do so via RCU. As an RCU lookup can occur
    at any time before we free an inode, and that lookup takes the inode
    flags lock, we cannot safely assert that the flags lock is not held
    just before marking it dead and running call_rcu() to free the
    inode.

    We check on allocation of a new inode structre that the lock is not
    held, so we still have protection against locks being leaked and
    hence not correctly initialised when allocated out of the slab.
    Hence just remove the assert...

    Signed-off-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • Regression introduced by commit 46f9d2e ("xfs: aborted buf items can
    be in the AIL") which fails to lock the AIL before removing the
    item. Spinlock debugging throws a warning about this.

    Signed-off-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     

13 Sep, 2013

3 commits

  • Pull xfs update #2 from Ben Myers:
    "Here we have defrag support for v5 superblock, a number of bugfixes
    and a cleanup or two.

    - defrag support for CRC filesystems
    - fix endian worning in xlog_recover_get_buf_lsn
    - fixes for sparse warnings
    - fix for assert in xfs_dir3_leaf_hdr_from_disk
    - fix for log recovery of remote symlinks
    - fix for log recovery of btree root splits
    - fixes formemory allocation failures with ACLs
    - fix for assert in xfs_buf_item_relse
    - fix for assert in xfs_inode_buf_verify
    - fix an assignment in an assert that should be a test in
    xfs_bmbt_change_owner
    - remove dead code in xlog_recover_inode_pass2"

    * tag 'xfs-for-linus-v3.12-rc1-2' of git://oss.sgi.com/xfs/xfs:
    xfs: remove dead code from xlog_recover_inode_pass2
    xfs: = vs == typo in ASSERT()
    xfs: don't assert fail on bad inode numbers
    xfs: aborted buf items can be in the AIL.
    xfs: factor all the kmalloc-or-vmalloc fallback allocations
    xfs: fix memory allocation failures with ACLs
    xfs: ensure we copy buffer type in da btree root splits
    xfs: set remote symlink buffer type for recovery
    xfs: recovery of swap extents operations for CRC filesystems
    xfs: swap extents operations for CRC filesystems
    xfs: check magic numbers in dir3 leaf verifier first
    xfs: fix some minor sparse warnings
    xfs: fix endian warning in xlog_recover_get_buf_lsn()

    Linus Torvalds
     
  • Merge more patches from Andrew Morton:
    "The rest of MM. Plus one misc cleanup"

    * emailed patches from Andrew Morton : (35 commits)
    mm/Kconfig: add MMU dependency for MIGRATION.
    kernel: replace strict_strto*() with kstrto*()
    mm, thp: count thp_fault_fallback anytime thp fault fails
    thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page()
    thp: do_huge_pmd_anonymous_page() cleanup
    thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
    mm: cleanup add_to_page_cache_locked()
    thp: account anon transparent huge pages into NR_ANON_PAGES
    truncate: drop 'oldsize' truncate_pagecache() parameter
    mm: make lru_add_drain_all() selective
    memcg: document cgroup dirty/writeback memory statistics
    memcg: add per cgroup writeback pages accounting
    memcg: check for proper lock held in mem_cgroup_update_page_stat
    memcg: remove MEMCG_NR_FILE_MAPPED
    memcg: reduce function dereference
    memcg: avoid overflow caused by PAGE_ALIGN
    memcg: rename RESOURCE_MAX to RES_COUNTER_MAX
    memcg: correct RESOURCE_MAX to ULLONG_MAX
    mm: memcg: do not trap chargers with full callstack on OOM
    mm: memcg: rework and document OOM waiting and wakeup
    ...

    Linus Torvalds
     
  • truncate_pagecache() doesn't care about old size since commit
    cedabed49b39 ("vfs: Fix vmtruncate() regression"). Let's drop it.

    Signed-off-by: Kirill A. Shutemov
    Cc: OGAWA Hirofumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

12 Sep, 2013

2 commits


11 Sep, 2013

18 commits

  • This patch adds the missing call to list_lru_destroy (spotted by Li Zhong)
    and moves the deletion to after the shrinker is unregistered, as correctly
    spotted by Dave

    Signed-off-by: Glauber Costa
    Cc: Michal Hocko
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Glauber Costa
     
  • We currently use a compile-time constant to size the node array for the
    list_lru structure. Due to this, we don't need to allocate any memory at
    initialization time. But as a consequence, the structures that contain
    embedded list_lru lists can become way too big (the superblock for
    instance contains two of them).

    This patch aims at ameliorating this situation by dynamically allocating
    the node arrays with the firmware provided nr_node_ids.

    Signed-off-by: Glauber Costa
    Cc: Dave Chinner
    Cc: Mel Gorman
    Cc: "Theodore Ts'o"
    Cc: Adrian Hunter
    Cc: Al Viro
    Cc: Artem Bityutskiy
    Cc: Arve Hjønnevåg
    Cc: Carlos Maiolino
    Cc: Christoph Hellwig
    Cc: Chuck Lever
    Cc: Daniel Vetter
    Cc: David Rientjes
    Cc: Gleb Natapov
    Cc: Greg Thelen
    Cc: J. Bruce Fields
    Cc: Jan Kara
    Cc: Jerome Glisse
    Cc: John Stultz
    Cc: KAMEZAWA Hiroyuki
    Cc: Kent Overstreet
    Cc: Kirill A. Shutemov
    Cc: Marcelo Tosatti
    Cc: Mel Gorman
    Cc: Steven Whitehouse
    Cc: Thomas Hellstrom
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Glauber Costa
     
  • The new LRU list isolation code in xfs_qm_dquot_isolate() isn't
    completely up to date. Firstly, it needs conversion to return enum
    lru_status values, not raw numbers. Secondly - most importantly - it
    fails to unlock the dquot and relock the LRU in the LRU_RETRY path.
    This leads to deadlocks in xfstests generic/232. Fix them.

    Signed-off-by: Dave Chinner
    Cc: Glauber Costa
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Dave Chinner
     
  • fix warnings

    Cc: Dave Chinner
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Andrew Morton
     
  • Convert the XFS dquot lru to use the list_lru construct and convert the
    shrinker to being node aware.

    [glommer@openvz.org: edited for conflicts + warning fixes]
    Signed-off-by: Dave Chinner
    Signed-off-by: Glauber Costa
    Cc: "Theodore Ts'o"
    Cc: Adrian Hunter
    Cc: Al Viro
    Cc: Artem Bityutskiy
    Cc: Arve Hjønnevåg
    Cc: Carlos Maiolino
    Cc: Christoph Hellwig
    Cc: Chuck Lever
    Cc: Daniel Vetter
    Cc: David Rientjes
    Cc: Gleb Natapov
    Cc: Greg Thelen
    Cc: J. Bruce Fields
    Cc: Jan Kara
    Cc: Jerome Glisse
    Cc: John Stultz
    Cc: KAMEZAWA Hiroyuki
    Cc: Kent Overstreet
    Cc: Kirill A. Shutemov
    Cc: Marcelo Tosatti
    Cc: Mel Gorman
    Cc: Steven Whitehouse
    Cc: Thomas Hellstrom
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton

    Signed-off-by: Al Viro

    Dave Chinner
     
  • In converting the buffer lru lists to use the generic code, the locking
    for marking the buffers as on the dispose list was lost. This results in
    confusion in LRU buffer tracking and acocunting, resulting in reference
    counts being mucked up and filesystem beig unmountable.

    To fix this, introduce an internal buffer spinlock to protect the state
    field that holds the dispose list information. Because there is now
    locking needed around xfs_buf_lru_add/del, and they are used in exactly
    one place each two lines apart, get rid of the wrappers and code the logic
    directly in place.

    Further, the LRU emptying code used on unmount is less than optimal.
    Convert it to use a dispose list as per a normal shrinker walk, and repeat
    the walk that fills the dispose list until the LRU is empty. Thi avoids
    needing to drop and regain the LRU lock for every item being freed, and
    allows the same logic as the shrinker isolate call to be used. Simpler,
    easier to understand.

    Signed-off-by: Dave Chinner
    Signed-off-by: Glauber Costa
    Cc: "Theodore Ts'o"
    Cc: Adrian Hunter
    Cc: Al Viro
    Cc: Artem Bityutskiy
    Cc: Arve Hjønnevåg
    Cc: Carlos Maiolino
    Cc: Christoph Hellwig
    Cc: Chuck Lever
    Cc: Daniel Vetter
    Cc: David Rientjes
    Cc: Gleb Natapov
    Cc: Greg Thelen
    Cc: J. Bruce Fields
    Cc: Jan Kara
    Cc: Jerome Glisse
    Cc: John Stultz
    Cc: KAMEZAWA Hiroyuki
    Cc: Kent Overstreet
    Cc: Kirill A. Shutemov
    Cc: Marcelo Tosatti
    Cc: Mel Gorman
    Cc: Steven Whitehouse
    Cc: Thomas Hellstrom
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Dave Chinner
     
  • fix warnings

    Cc: Dave Chinner
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Andrew Morton
     
  • Convert the buftarg LRU to use the new generic LRU list and take advantage
    of the functionality it supplies to make the buffer cache shrinker node
    aware.

    Signed-off-by: Glauber Costa
    Signed-off-by: Dave Chinner
    Cc: "Theodore Ts'o"
    Cc: Adrian Hunter
    Cc: Al Viro
    Cc: Artem Bityutskiy
    Cc: Arve Hjønnevåg
    Cc: Carlos Maiolino
    Cc: Christoph Hellwig
    Cc: Chuck Lever
    Cc: Daniel Vetter
    Cc: David Rientjes
    Cc: Gleb Natapov
    Cc: Greg Thelen
    Cc: J. Bruce Fields
    Cc: Jan Kara
    Cc: Jerome Glisse
    Cc: John Stultz
    Cc: KAMEZAWA Hiroyuki
    Cc: Kent Overstreet
    Cc: Kirill A. Shutemov
    Cc: Marcelo Tosatti
    Cc: Mel Gorman
    Cc: Steven Whitehouse
    Cc: Thomas Hellstrom
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Dave Chinner
     
  • Now that the shrinker is passing a node in the scan control structure, we
    can pass this to the the generic LRU list code to isolate reclaim to the
    lists on matching nodes.

    Signed-off-by: Dave Chinner
    Signed-off-by: Glauber Costa
    Acked-by: Mel Gorman
    Cc: "Theodore Ts'o"
    Cc: Adrian Hunter
    Cc: Al Viro
    Cc: Artem Bityutskiy
    Cc: Arve Hjønnevåg
    Cc: Carlos Maiolino
    Cc: Christoph Hellwig
    Cc: Chuck Lever
    Cc: Daniel Vetter
    Cc: David Rientjes
    Cc: Gleb Natapov
    Cc: Greg Thelen
    Cc: J. Bruce Fields
    Cc: Jan Kara
    Cc: Jerome Glisse
    Cc: John Stultz
    Cc: KAMEZAWA Hiroyuki
    Cc: Kent Overstreet
    Cc: Kirill A. Shutemov
    Cc: Marcelo Tosatti
    Cc: Mel Gorman
    Cc: Steven Whitehouse
    Cc: Thomas Hellstrom
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Dave Chinner
     
  • Convert superblock shrinker to use the new count/scan API, and propagate
    the API changes through to the filesystem callouts. The filesystem
    callouts already use a count/scan API, so it's just changing counters to
    longs to match the VM API.

    This requires the dentry and inode shrinker callouts to be converted to
    the count/scan API. This is mainly a mechanical change.

    [glommer@openvz.org: use mult_frac for fractional proportions, build fixes]
    Signed-off-by: Dave Chinner
    Signed-off-by: Glauber Costa
    Acked-by: Mel Gorman
    Cc: "Theodore Ts'o"
    Cc: Adrian Hunter
    Cc: Al Viro
    Cc: Artem Bityutskiy
    Cc: Arve Hjønnevåg
    Cc: Carlos Maiolino
    Cc: Christoph Hellwig
    Cc: Chuck Lever
    Cc: Daniel Vetter
    Cc: David Rientjes
    Cc: Gleb Natapov
    Cc: Greg Thelen
    Cc: J. Bruce Fields
    Cc: Jan Kara
    Cc: Jerome Glisse
    Cc: John Stultz
    Cc: KAMEZAWA Hiroyuki
    Cc: Kent Overstreet
    Cc: Kirill A. Shutemov
    Cc: Marcelo Tosatti
    Cc: Mel Gorman
    Cc: Steven Whitehouse
    Cc: Thomas Hellstrom
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton

    Signed-off-by: Al Viro

    Dave Chinner
     
  • The sysctl knob sysctl_vfs_cache_pressure is used to determine which
    percentage of the shrinkable objects in our cache we should actively try
    to shrink.

    It works great in situations in which we have many objects (at least more
    than 100), because the aproximation errors will be negligible. But if
    this is not the case, specially when total_objects < 100, we may end up
    concluding that we have no objects at all (total / 100 = 0, if total <
    100).

    This is certainly not the biggest killer in the world, but may matter in
    very low kernel memory situations.

    Signed-off-by: Glauber Costa
    Reviewed-by: Carlos Maiolino
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Cc: Dave Chinner
    Cc: Al Viro
    Cc: "Theodore Ts'o"
    Cc: Adrian Hunter
    Cc: Al Viro
    Cc: Artem Bityutskiy
    Cc: Arve Hjønnevåg
    Cc: Carlos Maiolino
    Cc: Christoph Hellwig
    Cc: Chuck Lever
    Cc: Daniel Vetter
    Cc: David Rientjes
    Cc: Gleb Natapov
    Cc: Greg Thelen
    Cc: J. Bruce Fields
    Cc: Jan Kara
    Cc: Jerome Glisse
    Cc: John Stultz
    Cc: KAMEZAWA Hiroyuki
    Cc: Kent Overstreet
    Cc: Kirill A. Shutemov
    Cc: Marcelo Tosatti
    Cc: Mel Gorman
    Cc: Steven Whitehouse
    Cc: Thomas Hellstrom
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Glauber Costa
     
  • Let the inode verifier do it's work by returning an error when we
    fail to find correct magic numbers in an inode buffer.

    Signed-off-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • Saw this on generic/270 after a DQALLOC transaction overrun
    shutdown:

    XFS: Assertion failed: !(bip->bli_item.li_flags & XFS_LI_IN_AIL), file: fs/xfs/xfs_buf_item.c, line: 952
    .....
    xfs_buf_item_relse+0x4f/0xd0
    xfs_buf_item_unlock+0x1b4/0x1e0
    xfs_trans_free_items+0x7d/0xb0
    xfs_trans_cancel+0x13c/0x1b0
    xfs_symlink+0x37e/0xa60
    ....

    When a transaction abort occured.

    If we are aborting a transaction and trigger this code path, then
    the item may be dirty. If the item is dirty, then it may be in the
    AIL. Hence if we are aborting, we need to check if the item is in
    the AIL and remove it before freeing it.

    Signed-off-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • We have quite a few places now where we do:

    x = kmem_zalloc(large size)
    if (!x)
    x = kmem_zalloc_large(large size)

    and do a similar dance when freeing the memory. kmem_free() already
    does the correct freeing dance, and kmem_zalloc_large() is only ever
    called in these constructs, so just factor it all into
    kmem_zalloc_large() and kmem_free().

    Signed-off-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • Ever since increasing the number of supported ACLs from 25 to as
    many as can fit in an xattr, there have been reports of order 4
    memory allocations failing in the ACL code. Fix it in the same way
    we've fixed all the xattr read/write code that has the same problem.

    Signed-off-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • When splitting the root of the da btree, we shuffled data between
    buffers and the structures that track them. At one point, we copy
    data and state from one buffer to another, including the ops
    associated with the buffer. When we do this, we also need to copy
    the buffer type associated with the buf log item so that the buffer
    is logged correctly. If we don't do that, log recovery won't
    recognise it and hence it won't recalculate the CRC on the buffer
    after recovery. This leads to a directory block that can't be read
    after recovery has run.

    Found by inspection after finding the same problem with remote
    symlink buffers.

    Signed-off-by: Dave Chinner
    Reviewed-by: Ben Myers
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • The logging of a remote symlink block does not set the buffer type
    being logged, and hence on recovery the type of buffer is not
    recognised and hence CRCs are not calculated after replay. This
    results in log recoery throwing:

    XFS (vdc): Unknown buffer type 0

    errors, and subsequent reads of the symlink failing CRC
    verification. Found via fsstress + godown.

    Reported by: Michael L. Semon
    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • This is the recovery side of the btree block owner change operation
    performed by swapext on CRC enabled filesystems. We detect that an
    owner change is needed by the flag that has been placed on the inode
    log format flag field. Because the inode recovery is being replayed
    after the buffers that make up the BMBT in the given checkpoint, we
    can walk all the buffers and directly modify them when we see the
    flag set on an inode.

    Because the inode can be relogged and hence present in multiple
    chekpoints with the "change owner" flag set, we could do multiple
    passes across the inode to do this change. While this isn't optimal,
    we can't directly ignore the flag as there may be multiple
    independent swap extent operations being replayed on the same inode
    in different checkpoints so we can't ignore them.

    Further, because the owner change operation uses ordered buffers, we
    might have buffers that are newer on disk than the current
    checkpoint and so already have the owner changed in them. Hence we
    cannot just peek at a buffer in the tree and check that it has the
    correct owner and assume that the change was completed.

    So, for the moment just brute force the owner change every time we
    see an inode with the flag set. Note that we have to be careful here
    because the owner of the buffers may point to either the old owner
    or the new owner. Currently the verifier can't verify the owner
    directly, so there is no failure case here right now. If we verify
    the owner exactly in future, then we'll have to take this into
    account.

    This was tested in terms of normal operation via xfstests - all of
    the fsr tests now pass without failure. however, we really need to
    modify xfs/227 to stress v3 inodes correctly to ensure we fully
    cover this case for v5 filesystems.

    In terms of recovery testing, I used a hacked version of xfs_fsr
    that held the temp inode open for a few seconds before exiting so
    that the filesystem could be shut down with an open owner change
    recovery flags set on at least the temp inode. fsr leaves the temp
    inode unlinked and in btree format, so this was necessary for the
    owner change to be reliably replayed.

    logprint confirmed the tmp inode in the log had the correct flag set:

    INO: cnt:3 total:3 a:0x69e9e0 len:56 a:0x69ea20 len:176 a:0x69eae0 len:88
    INODE: #regs:3 ino:0x44 flags:0x209 dsize:88
    ^^^^^

    0x200 is set, indicating a data fork owner change needed to be
    replayed on inode 0x44. A printk in the revoery code confirmed that
    the inode change was recovered:

    XFS (vdc): Mounting Filesystem
    XFS (vdc): Starting recovery (logdev: internal)
    recovering owner change ino 0x44
    XFS (vdc): Version 5 superblock detected. This kernel L support enabled!
    Use of these features in this kernel is at your own risk!
    XFS (vdc): Ending recovery (logdev: internal)

    The script used to test this was:

    $ cat ./recovery-fsr.sh
    #!/bin/bash

    dev=/dev/vdc
    mntpt=/mnt/scratch
    testfile=$mntpt/testfile

    umount $mntpt
    mkfs.xfs -f -m crc=1 $dev
    mount $dev $mntpt
    chmod 777 $mntpt

    for i in `seq 10000 -1 0`; do
    xfs_io -f -d -c "pwrite $(($i * 4096)) 4096" $testfile > /dev/null 2>&1
    done
    xfs_bmap -vp $testfile |head -20

    xfs_fsr -d -v $testfile &
    sleep 10
    /home/dave/src/xfstests-dev/src/godown -f $mntpt
    wait
    umount $mntpt

    xfs_logprint -t $dev |tail -20
    time mount $dev $mntpt
    xfs_bmap -vp $testfile
    umount $mntpt
    $

    Signed-off-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     

10 Sep, 2013

5 commits

  • For CRC enabled filesystems, we can't just swap inode forks from one
    inode to another when defragmenting a file - the blocks in the inode
    fork bmap btree contain pointers back to the owner inode. Hence if
    we are to swap the inode forks we have to atomically modify every
    block in the btree during the transaction.

    We are doing an entire fork swap here, so we could create a new
    transaction item type that indicates we are changing the owner of a
    certain structure from one value to another. If we combine this with
    ordered buffer logging to modify all the buffers in the tree, then
    we can change the buffers in the tree without needing log space for
    the operation. However, this then requires log recovery to perform
    the modification of the owner information of the objects/structures
    in question.

    This does introduce some interesting ordering details into recovery:
    we have to make sure that the owner change replay occurs after the
    change that moves the objects is made, not before. Hence we can't
    use a separate log item for this as we have no guarantee of strict
    ordering between multiple items in the log due to the relogging
    action of asynchronous transaction commits. Hence there is no
    "generic" method we can use for changing the ownership of arbitrary
    metadata structures.

    For inode forks, however, there is a simple method of communicating
    that the fork contents need the owner rewritten - we can pass a
    inode log format flag for the fork for the transaction that does a
    fork swap. This flag will then follow the inode fork through
    relogging actions so when the swap actually gets replayed the
    ownership can be changed immediately by log recovery. So that gives
    us a simple method of "whole fork" exchange between two inodes.

    This is relatively simple to implement, so it makes sense to do this
    as an initial implementation to support xfs_fsr on CRC enabled
    filesytems in the same manner as we do on existing filesystems. This
    commit introduces the swapext driven functionality, the recovery
    functionality will be in a separate patch.

    Signed-off-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • Calling xfs_dir3_leaf_hdr_from_disk() in a verifier before
    validating the magic numbers in the buffer results in ASSERT
    failures due to mismatching magic numbers when a corruption occurs.
    Seeing as the verifier is supposed to catch the corruption and pass
    it back to the caller, having the verifier assert fail on error
    defeats the purpose of detecting the errors in the first place.

    Check the magic numbers direct from the buffer before decoding the
    header.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • A couple of simple locking annotations and 0 vs NULL warnings.
    Nothing that changes any code behaviour, just removes build noise.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • sparse reports:

    fs/xfs/xfs_log_recover.c:2017:24: sparse: cast to restricted __be64

    Because I used the wrong structure for the on-disk superblock cast
    in 50d5c8d ("xfs: check LSN ordering for v5 superblocks during
    recovery"). Fix it.

    Reported-by: kbuild test robot
    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • Pull xfs updates from Ben Myers:
    "For 3.12-rc1 there are a number of bugfixes in addition to work to
    ease usage of shared code between libxfs and the kernel, the rest of
    the work to enable project and group quotas to be used simultaneously,
    performance optimisations in the log and the CIL, directory entry file
    type support, fixes for log space reservations, some spelling/grammar
    cleanups, and the addition of user namespace support.

    - introduce readahead to log recovery
    - add directory entry file type support
    - fix a number of spelling errors in comments
    - introduce new Q_XGETQSTATV quotactl for project quotas
    - add USER_NS support
    - log space reservation rework
    - CIL optimisations
    - kernel/userspace libxfs rework"

    * tag 'xfs-for-linus-v3.12-rc1' of git://oss.sgi.com/xfs/xfs: (112 commits)
    xfs: XFS_MOUNT_QUOTA_ALL needed by userspace
    xfs: dtype changed xfs_dir2_sfe_put_ino to xfs_dir3_sfe_put_ino
    Fix wrong flag ASSERT in xfs_attr_shortform_getvalue
    xfs: finish removing IOP_* macros.
    xfs: inode log reservations are too small
    xfs: check correct status variable for xfs_inobt_get_rec() call
    xfs: inode buffers may not be valid during recovery readahead
    xfs: check LSN ordering for v5 superblocks during recovery
    xfs: btree block LSN escaping to disk uninitialised
    XFS: Assertion failed: first < BBTOB(bp->b_length), file: fs/xfs/xfs_trans_buf.c, line: 568
    xfs: fix bad dquot buffer size in log recovery readahead
    xfs: don't account buffer cancellation during log recovery readahead
    xfs: check for underflow in xfs_iformat_fork()
    xfs: xfs_dir3_sfe_put_ino can be static
    xfs: introduce object readahead to log recovery
    xfs: Simplify xfs_ail_min() with list_first_entry_or_null()
    xfs: Register hotcpu notifier after initialization
    xfs: add xfs sb v4 support for dirent filetype field
    xfs: Add write support for dirent filetype field
    xfs: Add read-only support for dirent filetype field
    ...

    Linus Torvalds
     

04 Sep, 2013

3 commits

  • Add support to the core direct-io code to defer AIO completions to user
    context using a workqueue. This replaces opencoded and less efficient
    code in XFS and ext4 (we save a memory allocation for each direct IO)
    and will be needed to properly support O_(D)SYNC for AIO.

    The communication between the filesystem and the direct I/O code requires
    a new buffer head flag, which is a bit ugly but not avoidable until the
    direct I/O code stops abusing the buffer_head structure for communicating
    with the filesystems.

    Currently this creates a per-superblock unbound workqueue for these
    completions, which is taken from an earlier patch by Jan Kara. I'm
    not really convinced about this use and would prefer a "normal" global
    workqueue with a high concurrency limit, but this needs further discussion.

    JK: Fixed ext4 part, dynamic allocation of the workqueue.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • So move it to a header file shared with userspace.

    Signed-off-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • So fix up the export in xfs_dir2.h that is needed by userspace.

    Now xfs_dir3_sfe_put_ino has been made static. Revert 98f7462 ("xfs:
    xfs_dir3_sfe_put_ino can be static") to being non static so that the
    code shared with userspace is identical again.

    Signed-off-by: Dave Chinner
    Reviewed-by: Ben Myers
    Signed-off-by: Ben Myers

    Dave Chinner