13 Sep, 2017

1 commit

  • If using a kernel with CONFIG_XFS_RT=y and we set the RHINHERIT flag on
    a directory in a filesystem that does not have a realtime device and
    create a new file in that directory, it gets marked as a real time file.
    When data is written and a fsync is issued, the filesystem attempts to
    flush a non-existent rt device during the fsync process.

    This results in a crash dereferencing a null buftarg pointer in
    xfs_blkdev_issue_flush():

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
    IP: xfs_blkdev_issue_flush+0xd/0x20
    .....
    Call Trace:
    xfs_file_fsync+0x188/0x1c0
    vfs_fsync_range+0x3b/0xa0
    do_fsync+0x3d/0x70
    SyS_fsync+0x10/0x20
    do_syscall_64+0x4d/0xb0
    entry_SYSCALL64_slow_path+0x25/0x25

    Setting RT inode flags does not require special privileges so any
    unprivileged user can cause this oops to occur. To reproduce, confirm
    kernel is compiled with CONFIG_XFS_RT=y and run:

    # mkfs.xfs -f /dev/pmem0
    # mount /dev/pmem0 /mnt/test
    # mkdir /mnt/test/foo
    # xfs_io -c 'chattr +t' /mnt/test/foo
    # xfs_io -f -c 'pwrite 0 5m' -c fsync /mnt/test/foo/bar

    Or just run xfstests with MKFS_OPTIONS="-d rtinherit=1" and wait.

    Kernels built with CONFIG_XFS_RT=n are not exposed to this bug.

    Fixes: f538d4da8d52 ("[XFS] write barrier support")
    Cc:
    Signed-off-by: Richard Wareing
    Signed-off-by: Dave Chinner
    Signed-off-by: Linus Torvalds

    Richard Wareing
     

11 Jul, 2017

1 commit

  • Pull XFS updates from Darrick Wong:
    "Here are some changes for you for 4.13. For the most part it's fixes
    for bugs and deadlock problems, and preparation for online fsck in
    some future merge window.

    - Avoid quotacheck deadlocks

    - Fix transaction overflows when bunmapping fragmented files

    - Refactor directory readahead

    - Allow admin to configure if ASSERT is fatal

    - Improve transaction usage detail logging during overflows

    - Minor cleanups

    - Don't leak log items when the log shuts down

    - Remove double-underscore typedefs

    - Various preparation for online scrubbing

    - Introduce new error injection configuration sysfs knobs

    - Refactor dq_get_next to use extent map directly

    - Fix problems with iterating the page cache for unwritten data

    - Implement SEEK_{HOLE,DATA} via iomap

    - Refactor XFS to use iomap SEEK_HOLE and SEEK_DATA

    - Don't use MAXPATHLEN to check on-disk symlink target lengths"

    * tag 'xfs-4.13-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (48 commits)
    xfs: don't crash on unexpected holes in dir/attr btrees
    xfs: rename MAXPATHLEN to XFS_SYMLINK_MAXLEN
    xfs: fix contiguous dquot chunk iteration livelock
    xfs: Switch to iomap for SEEK_HOLE / SEEK_DATA
    vfs: Add iomap_seek_hole and iomap_seek_data helpers
    vfs: Add page_cache_seek_hole_data helper
    xfs: remove a whitespace-only line from xfs_fs_get_nextdqblk
    xfs: rewrite xfs_dq_get_next_id using xfs_iext_lookup_extent
    xfs: Check for m_errortag initialization in xfs_errortag_test
    xfs: grab dquots without taking the ilock
    xfs: fix semicolon.cocci warnings
    xfs: Don't clear SGID when inheriting ACLs
    xfs: free cowblocks and retry on buffered write ENOSPC
    xfs: replace log_badcrc_factor knob with error injection tag
    xfs: convert drop_writes to use the errortag mechanism
    xfs: remove unneeded parameter from XFS_TEST_ERROR
    xfs: expose errortag knobs via sysfs
    xfs: make errortag a per-mountpoint structure
    xfs: free uncommitted transactions during log recovery
    xfs: don't allow bmap on rt files
    ...

    Linus Torvalds
     

07 Jul, 2017

1 commit

  • XFS has a maximum symlink target length of 1024 bytes; this is a
    holdover from the Irix days. Unfortunately, the constant establishing
    this is 'MAXPATHLEN' and is /not/ the same as the Linux MAXPATHLEN,
    which is 4096.

    The kernel enforces its 1024 byte MAXPATHLEN on symlink targets, but
    xfsprogs picks up the (Linux) system 4096 byte MAXPATHLEN, which means
    that xfs_repair doesn't complain about oversized symlinks.

    Since this is an on-disk format constraint, put the define in the XFS
    namespace and move everything over to use the new name.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster

    Darrick J. Wong
     

20 Jun, 2017

1 commit

  • This is a purely mechanical patch that removes the private
    __{u,}int{8,16,32,64}_t typedefs in favor of using the system
    {u,}int{8,16,32,64}_t typedefs. This is the sed script used to perform
    the transformation and fix the resulting whitespace and indentation
    errors:

    s/typedef\t__uint8_t/typedef __uint8_t\t/g
    s/typedef\t__uint/typedef __uint/g
    s/typedef\t__int\([0-9]*\)_t/typedef int\1_t\t/g
    s/__uint8_t\t/__uint8_t\t\t/g
    s/__uint/uint/g
    s/__int\([0-9]*\)_t\t/__int\1_t\t\t/g
    s/__int/int/g
    /^typedef.*int[0-9]*_t;$/d

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     

05 Jun, 2017

3 commits

  • Use the common helper uuid_is_null() and remove the xfs specific
    helper uuid_is_nil().

    The common helper does not check for the NULL pointer value as
    xfs helper did, but xfs code never calls the helper with a pointer
    that can be NULL.

    Conform comments and warning strings to use the term 'null uuid'
    instead of 'nil uuid', because this is the terminology used by
    lib/uuid.c and its users. It is also the terminology used in
    userspace by libuuid and xfsprogs.

    Signed-off-by: Amir Goldstein
    [hch: remove now unused uuid.[ch]]
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Reviewed-by: Andy Shevchenko

    Amir Goldstein
     
  • Our "little endian" UUID really is a Wintel GUID, so rename it and its
    helpers such (guid_t). The big endian UUID is the only true one, so
    give it the name uuid_t. The uuid_le and uuid_be names are retained for
    now, but will hopefully go away soon. The exception to that are the _cmp
    helpers that will be replaced by better primitives ASAP and thus don't
    get the new names.

    Also the _to_bin helpers are named to match the better named uuid_parse
    routine in userspace.

    Also remove the existing typedef in XFS that's now been superceeded by
    the generic type name.

    Signed-off-by: Christoph Hellwig
    [andy: also update the UUID_LE/UUID_BE macros including fallout]
    Signed-off-by: Andy Shevchenko
    Reviewed-by: Amir Goldstein
    Reviewed-by: Darrick J. Wong
    Reviewed-by: Andy Shevchenko

    Signed-off-by: Christoph Hellwig

    Christoph Hellwig
     
  • Use the generic Linux definition to implement our UUID type, this will
    allow using more generic infrastructure in the future.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Amir Goldstein
    Reviewed-by: Brian Foster
    Reviewed-by: Andy Shevchenko
    Reviewed-by: Darrick J. Wong

    Christoph Hellwig
     

12 Apr, 2017

1 commit

  • Long ago, all this gunk was added with a lament about problems
    with gcc's do_div, and a fun recommendation in the changelog:

    egcs-2.91.66 is the recommended compiler version for building XFS.

    All this special stuff was needed to work around an old gcc bug,
    apparently, and it's been there ever since.

    There should be no need for this anymore, so remove it.

    Remove the special 32-bit xfs_do_mod as well; just let the
    kernel's do_div() handle all this.

    Signed-off-by: Eric Sandeen
    Reviewed-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Eric Sandeen
     

02 Mar, 2017

1 commit


18 Jan, 2017

1 commit


25 Dec, 2016

1 commit


07 Dec, 2016

1 commit

  • On filesystems with a lot of metadata and in metadata intensive workloads
    xfs_buf_find() is showing up at the top of the CPU cycles trace. Most of
    the CPU time is spent on CPU cache misses while traversing the rbtree.

    As the buffer cache does not need any kind of ordering, but fast lookups
    a hashtable is the natural data structure to use. The rhashtable
    infrastructure provides a self-scaling hashtable implementation and
    allows lookups to proceed while the table is going through a resize
    operation.

    This reduces the CPU-time spent for the lookups to 1/3 even for small
    filesystems with a relatively small number of cached buffers, with
    possibly much larger gains on higher loaded filesystems.

    [dchinner: reduce minimum hash size to an acceptable size for large
    filesystems with many AGs with no active use.]
    [dchinner: remove stale rbtree asserts.]
    [dchinner: use xfs_buf_map for compare function argument.]
    [dchinner: make functions static.]
    [dchinner: remove redundant comments.]

    Signed-off-by: Lucas Stach
    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Lucas Stach
     

06 Oct, 2016

1 commit

  • Trim CoW reservations made on behalf of a cowextsz hint if they get too
    old or we run low on quota, so long as we don't have dirty data awaiting
    writeback or directio operations in progress.

    Garbage collection of the cowextsize extents are kept separate from
    prealloc extent reaping because setting the CoW prealloc lifetime to a
    (much) higher value than the regular prealloc extent lifetime has been
    useful for combatting CoW fragmentation on VM hosts where the VMs
    experience bursty write behaviors and we can keep the utilization ratios
    low enough that we don't start to run out of space. IOWs, it benefits
    us to keep the CoW fork reservations around for as long as we can unless
    we run out of blocks or hit inode reclaim.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     

20 Jul, 2016

1 commit

  • Instead we always declare struct xfs_dir2_sf_hdr as packed. That's
    the expected layout, and while most major architectures do the packing
    by default the new structure size and offset checker showed that not
    only the ARM old ABI got this wrong, but various minor embedded
    architectures did as well.

    [Verified that no code change on x86-64 results from this change]

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     

05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

12 Oct, 2015

1 commit

  • This patch is the next step toward per-fs xfs stats. The patch makes
    the show and clear routines able to handle any stats structure
    associated with a kobject.

    Instead of a single global xfsstats structure, add kobject and a pointer
    to a per-cpu struct xfsstats. Modify the macros that manipulate the stats
    accordingly: XFS_STATS_INC, XFS_STATS_DEC, and XFS_STATS_ADD now access
    xfsstats->xs_stats.

    The sysfs functions need to get from the kobject back to the xfsstats
    structure which contains it, and pass the pointer to the ->xs_stats
    percpu structure into the show & clear routines.

    Signed-off-by: Bill O'Donnell
    Reviewed-by: Eric Sandeen
    Signed-off-by: Dave Chinner

    Bill O'Donnell
     

22 Jun, 2015

3 commits


23 Feb, 2015

1 commit

  • Now that the in-core superblock infrastructure has been replaced with
    generic per-cpu counters, we don't need it anymore. Nuke it from
    orbit so we are sure that it won't haunt us again...

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     

28 Nov, 2014

1 commit

  • More consolidatation for the on-disk format defintions. Note that the
    XFS_IS_REALTIME_INODE moves to xfs_linux.h instead as it is not related
    to the on disk format, but depends on a CONFIG_ option.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     

02 Oct, 2014

1 commit

  • The typedef for timespecs and nanotime() are completely unnecessary,
    and delay() can be moved to fs/xfs/linux.h, which means this file
    can go away.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     

04 Aug, 2014

3 commits


30 Jul, 2014

1 commit

  • Trying to support tiny disks only and saving a bit memory might have
    made sense on an SGI O2 15 years ago, but is pretty pointless today.

    Remove the rarely tested codepath that uses various smaller in-memory
    types to reduce our test matrix and make the codebase a little bit
    smaller and less complicated.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ben Myers
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     

15 Jul, 2014

1 commit

  • Embed a base kobject into xfs_mount. This creates a kobject associated
    with each XFS mount and a subdirectory in sysfs with the name of the
    filesystem. The subdirectory lifecycle matches that of the mount. Also
    add the new xfs_sysfs.[c,h] source files with some XFS sysfs
    infrastructure to facilitate attribute creation.

    Note that there are currently no attributes exported as part of the
    xfs_mount kobject. It exists solely to serve as a per-mount container
    for child objects.

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Brian Foster
     

22 Jun, 2014

1 commit

  • return is not a function. "return(EIO);" is silly;
    "return (EIO);" moreso. return is not a function.
    Nuke the pointless parens.

    [dchinner: catch a couple of extra cases in xfs_attr_list.c,
    xfs_acl.c and xfs_linux.h.]

    Signed-off-by: Eric Sandeen
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Eric Sandeen
     

27 Feb, 2014

2 commits

  • We want to distinguish between corruption, CRC errors,
    etc. In addition, the full stack trace on verifier errors
    seems less than helpful; it looks more like an oops than
    corruption.

    Create a new function to specifically alert the user to
    verifier errors, which can differentiate between
    EFSCORRUPTED and CRC mismatches. It doesn't dump stack
    unless the xfs error level is turned up high.

    Define a new error message (EFSBADCRC) to clearly identify
    CRC errors. (Defined to EBADMSG, bad message)

    Signed-off-by: Eric Sandeen
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Eric Sandeen
     
  • Many/most callers of xfs_verify_cksum() pass bp->b_addr and
    BBTOB(bp->b_length) as the first 2 args. Add a helper
    which can just accept the bp and the crc offset, and work
    it out on its own, for brevity.

    Signed-off-by: Eric Sandeen
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Eric Sandeen
     

21 Aug, 2013

1 commit


16 Aug, 2013

1 commit


13 Aug, 2013

1 commit

  • xfs_types.h is shared with userspace, so having kernel specific
    types defined in it is problematic. Move all the kernel specific
    defines to xfs_linux.h so we can remove the __KERNEL__ guards from
    xfs_types.h.

    Signed-off-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     

08 May, 2013

1 commit

  • Running a CONFIG_XFS_DEBUG kernel in production environments is not
    the best idea as it introduces significant overhead, can change
    the behaviour of algorithms (such as allocation) to improve test
    coverage, and (most importantly) panic the machine on non-fatal
    errors.

    There are many cases where all we want to do is run a
    kernel with more bounds checking enabled, such as is provided by the
    ASSERT() statements throughout the code, but without all the
    potential overhead and drawbacks.

    This patch converts all the ASSERT statements to evaluate as
    WARN_ON(1) statements and hence if they fail dump a warning and a
    stack trace to the log. This has minimal overhead and does not
    change any algorithms, and will allow us to find strange "out of
    bounds" problems more easily on production machines.

    There are a few places where assert statements contain debug only
    code. These are converted to be debug-or-warn only code so that we
    still get all the assert checks in the code.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Ben Myers

    Dave Chinner
     

04 Apr, 2013

1 commit

  • Ratelimited printk will be useful in printing xfs messages which are otherwise
    not required to be printed always due to their high rate (to prevent kernel ring
    buffer from overflowing), while at the same time required to be printed.

    Signed-off-by: Raghavendra D Prabhu
    Reviewed-by: Rich Johnston
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Rich Johnston
     

20 Nov, 2012

1 commit

  • - add a mount feature bit for CRC enabled filesystems
    - add some helpers for generating and verifying the CRCs
    - add a copy_uuid helper

    The checksumming helpers are loosely based on similar ones in sctp,
    all other bits come from Dave Chinner.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

09 Nov, 2012

1 commit

  • Create a new mount workqueue and delayed_work to enable background
    scanning and freeing of eofblocks inodes. The scanner kicks in once
    speculative preallocation occurs and stops requeueing itself when
    no eofblocks inodes exist.

    The scan interval is based on the new
    'speculative_prealloc_lifetime' tunable (default to 5m). The
    background scanner performs unfiltered, best effort scans (which
    skips inodes under lock contention or with a dirty cache mapping).

    Signed-off-by: Brian Foster
    Reviewed-by: Mark Tinguely
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Brian Foster
     

12 Oct, 2011

1 commit

  • Currently we have a few issues with the way the workqueue code is used to
    implement AIL pushing:

    - it accidentally uses the same workqueue as the syncer action, and thus
    can be prevented from running if there are enough sync actions active
    in the system.
    - it doesn't use the HIGHPRI flag to queue at the head of the queue of
    work items

    At this point I'm not confident enough in getting all the workqueue flags and
    tweaks right to provide a perfectly reliable execution context for AIL
    pushing, which is the most important piece in XFS to make forward progress
    when the log fills.

    Revert back to use a kthread per filesystem which fixes all the above issues
    at the cost of having a task struct and stack around for each mounted
    filesystem. In addition this also gives us much better ways to diagnose
    any issues involving hung AIL pushing and removes a small amount of code.

    Signed-off-by: Christoph Hellwig
    Reported-by: Stefan Priebe
    Tested-by: Stefan Priebe
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

13 Aug, 2011

1 commit

  • Use the move from Linux 2.6 to Linux 3.x as an excuse to kill the
    annoying subdirectories in the XFS source code. Besides the large
    amount of file rename the only changes are to the Makefile, a few
    files including headers with the subdirectory prefix, and the binary
    sysctl compat code that includes a header under fs/xfs/ from
    kernel/.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig