15 Mar, 2017

1 commit

  • When we're reading or writing the data fork of an inline directory,
    check the contents to make sure we're not overflowing buffers or eating
    garbage data. xfs/348 corrupts an inline symlink into an inline
    directory, triggering a buffer overflow bug.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster
    ---
    v2: add more checks consistent with _dir2_sf_check and make the verifier
    usable from anywhere.

    Darrick J. Wong
     

09 Mar, 2017

2 commits

  • When a reflink operation causes the bmap code to allocate a btree block
    we're currently doing single-AG allocations due to having ->firstblock
    set and then try any higher AG due a little reflink quirk we've put in
    when adding the reflink code. But given that we do not have a minleft
    reservation of any kind in this AG we can still not have any space in
    the same or higher AG even if the file system has enough free space.
    To fix this use a XFS_ALLOCTYPE_FIRST_AG allocation in this fall back
    path instead.

    [And yes, we need to redo this properly instead of piling hacks over
    hacks. I'm working on that, but it's not going to be a small series.
    In the meantime this fixes the customer reported issue]

    Also add a warning for failing allocations to make it easier to debug.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • Commit fa7f138 ("xfs: clear delalloc and cache on buffered write
    failure") fixed one regression in the iomap error handling code and
    exposed another. The fundamental problem is that if a buffered write
    is a rewrite of preexisting delalloc blocks and the write fails, the
    failure handling code can punch out preexisting blocks with valid
    file data.

    This was reproduced directly by sub-block writes in the LTP
    kernel/syscalls/write/write03 test. A first 100 byte write allocates
    a single block in a file. A subsequent 100 byte write fails and
    punches out the block, including the data successfully written by
    the previous write.

    To address this problem, update the ->iomap_begin() handler to
    distinguish newly allocated delalloc blocks from preexisting
    delalloc blocks via the IOMAP_F_NEW flag. Use this flag in the
    ->iomap_end() handler to decide when a failed or short write should
    punch out delalloc blocks.

    This introduces the subtle requirement that ->iomap_begin() should
    never combine newly allocated delalloc blocks with existing blocks
    in the resulting iomap descriptor. This can occur when a new
    delalloc reservation merges with a neighboring extent that is part
    of the current write, for example. Therefore, drop the
    post-allocation extent lookup from xfs_bmapi_reserve_delalloc() and
    just return the record inserted into the fork. This ensures only new
    blocks are returned and thus that preexisting delalloc blocks are
    always handled as "found" blocks and not punched out on a failed
    rewrite.

    Reported-by: Xiong Zhou
    Signed-off-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Brian Foster
     

08 Mar, 2017

4 commits

  • The sole remaining caller of kmem_zalloc_greedy is bulkstat, which uses
    it to grab 1-4 pages for staging of inobt records. The infinite loop in
    the greedy allocation function is causing hangs[1] in generic/269, so
    just get rid of the greedy allocator in favor of kmem_zalloc_large.
    This makes bulkstat somewhat more likely to ENOMEM if there's really no
    pages to spare, but eliminates a source of hangs.

    [1] http://lkml.kernel.org/r/20170301044634.rgidgdqqiiwsmfpj%40XZHOUW.usersys.redhat.com

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig
    ---
    v2: remove single-page fallback

    Darrick J. Wong
     
  • When block size is larger than inode cluster size, the call to
    XFS_B_TO_FSBT(mp, mp->m_inode_cluster_size) returns 0. Also, mkfs.xfs
    would have set xfs_sb->sb_inoalignmt to 0. Hence in
    xfs_set_inoalignment(), xfs_mount->m_inoalign_mask gets initialized to
    -1 instead of 0. However, xfs_mount->m_sinoalign would get correctly
    intialized to 0 because for every positive value of xfs_mount->m_dalign,
    the condition "!(mp->m_dalign & mp->m_inoalign_mask)" would evaluate to
    false.

    Also, xfs_imap() worked fine even with xfs_mount->m_inoalign_mask having
    -1 as the value because blks_per_cluster variable would have the value 1
    and hence we would never have a need to use xfs_mount->m_inoalign_mask
    to compute the inode chunk's agbno and offset within the chunk.

    Signed-off-by: Chandan Rajendra
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Chandan Rajendra
     
  • There are two different cases of buffered I/O errors:

    - first we can have an already shutdown fs. In that case we should skip
    any on-disk operations and just clean up the appen transaction if
    present and destroy the ioend
    - a real I/O error. In that case we should cleanup any lingering COW
    blocks. This gets skipped in the current code and is fixed by this
    patch.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • We only want to reclaim preallocations from our periodic work item.
    Currently this is archived by looking for a dirty inode, but that check
    is rather fragile. Instead add a flag to xfs_reflink_cancel_cow_* so
    that the caller can ask for just cancelling unwritten extents in the COW
    fork.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    [darrick: fix typos in commit message]
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     

04 Mar, 2017

1 commit

  • Pull vfs 'statx()' update from Al Viro.

    This adds the new extended stat() interface that internally subsumes our
    previous stat interfaces, and allows user mode to specify in more detail
    what kind of information it wants.

    It also allows for some explicit synchronization information to be
    passed to the filesystem, which can be relevant for network filesystems:
    is the cached value ok, or do you need open/close consistency, or what?

    From David Howells.

    Andreas Dilger points out that the first version of the extended statx
    interface was posted June 29, 2010:

    https://www.spinics.net/lists/linux-fsdevel/msg33831.html

    * 'rebased-statx' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    statx: Add a system call to make enhanced file info available

    Linus Torvalds
     

03 Mar, 2017

1 commit

  • Add a system call to make extended file information available, including
    file creation and some attribute flags where available through the
    underlying filesystem.

    The getattr inode operation is altered to take two additional arguments: a
    u32 request_mask and an unsigned int flags that indicate the
    synchronisation mode. This change is propagated to the vfs_getattr*()
    function.

    Functions like vfs_stat() are now inline wrappers around new functions
    vfs_statx() and vfs_statx_fd() to reduce stack usage.

    ========
    OVERVIEW
    ========

    The idea was initially proposed as a set of xattrs that could be retrieved
    with getxattr(), but the general preference proved to be for a new syscall
    with an extended stat structure.

    A number of requests were gathered for features to be included. The
    following have been included:

    (1) Make the fields a consistent size on all arches and make them large.

    (2) Spare space, request flags and information flags are provided for
    future expansion.

    (3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
    __s64).

    (4) Creation time: The SMB protocol carries the creation time, which could
    be exported by Samba, which will in turn help CIFS make use of
    FS-Cache as that can be used for coherency data (stx_btime).

    This is also specified in NFSv4 as a recommended attribute and could
    be exported by NFSD [Steve French].

    (5) Lightweight stat: Ask for just those details of interest, and allow a
    netfs (such as NFS) to approximate anything not of interest, possibly
    without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
    Dilger] (AT_STATX_DONT_SYNC).

    (6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
    its cached attributes are up to date [Trond Myklebust]
    (AT_STATX_FORCE_SYNC).

    And the following have been left out for future extension:

    (7) Data version number: Could be used by userspace NFS servers [Aneesh
    Kumar].

    Can also be used to modify fill_post_wcc() in NFSD which retrieves
    i_version directly, but has just called vfs_getattr(). It could get
    it from the kstat struct if it used vfs_xgetattr() instead.

    (There's disagreement on the exact semantics of a single field, since
    not all filesystems do this the same way).

    (8) BSD stat compatibility: Including more fields from the BSD stat such
    as creation time (st_btime) and inode generation number (st_gen)
    [Jeremy Allison, Bernd Schubert].

    (9) Inode generation number: Useful for FUSE and userspace NFS servers
    [Bernd Schubert].

    (This was asked for but later deemed unnecessary with the
    open-by-handle capability available and caused disagreement as to
    whether it's a security hole or not).

    (10) Extra coherency data may be useful in making backups [Andreas Dilger].

    (No particular data were offered, but things like last backup
    timestamp, the data version number and the DOS archive bit would come
    into this category).

    (11) Allow the filesystem to indicate what it can/cannot provide: A
    filesystem can now say it doesn't support a standard stat feature if
    that isn't available, so if, for instance, inode numbers or UIDs don't
    exist or are fabricated locally...

    (This requires a separate system call - I have an fsinfo() call idea
    for this).

    (12) Store a 16-byte volume ID in the superblock that can be returned in
    struct xstat [Steve French].

    (Deferred to fsinfo).

    (13) Include granularity fields in the time data to indicate the
    granularity of each of the times (NFSv4 time_delta) [Steve French].

    (Deferred to fsinfo).

    (14) FS_IOC_GETFLAGS value. These could be translated to BSD's st_flags.
    Note that the Linux IOC flags are a mess and filesystems such as Ext4
    define flags that aren't in linux/fs.h, so translation in the kernel
    may be a necessity (or, possibly, we provide the filesystem type too).

    (Some attributes are made available in stx_attributes, but the general
    feeling was that the IOC flags were to ext[234]-specific and shouldn't
    be exposed through statx this way).

    (15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
    Michael Kerrisk].

    (Deferred, probably to fsinfo. Finding out if there's an ACL or
    seclabal might require extra filesystem operations).

    (16) Femtosecond-resolution timestamps [Dave Chinner].

    (A __reserved field has been left in the statx_timestamp struct for
    this - if there proves to be a need).

    (17) A set multiple attributes syscall to go with this.

    ===============
    NEW SYSTEM CALL
    ===============

    The new system call is:

    int ret = statx(int dfd,
    const char *filename,
    unsigned int flags,
    unsigned int mask,
    struct statx *buffer);

    The dfd, filename and flags parameters indicate the file to query, in a
    similar way to fstatat(). There is no equivalent of lstat() as that can be
    emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags. There is
    also no equivalent of fstat() as that can be emulated by passing a NULL
    filename to statx() with the fd of interest in dfd.

    Whether or not statx() synchronises the attributes with the backing store
    can be controlled by OR'ing a value into the flags argument (this typically
    only affects network filesystems):

    (1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
    respect.

    (2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
    its attributes with the server - which might require data writeback to
    occur to get the timestamps correct.

    (3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
    network filesystem. The resulting values should be considered
    approximate.

    mask is a bitmask indicating the fields in struct statx that are of
    interest to the caller. The user should set this to STATX_BASIC_STATS to
    get the basic set returned by stat(). It should be noted that asking for
    more information may entail extra I/O operations.

    buffer points to the destination for the data. This must be 256 bytes in
    size.

    ======================
    MAIN ATTRIBUTES RECORD
    ======================

    The following structures are defined in which to return the main attribute
    set:

    struct statx_timestamp {
    __s64 tv_sec;
    __s32 tv_nsec;
    __s32 __reserved;
    };

    struct statx {
    __u32 stx_mask;
    __u32 stx_blksize;
    __u64 stx_attributes;
    __u32 stx_nlink;
    __u32 stx_uid;
    __u32 stx_gid;
    __u16 stx_mode;
    __u16 __spare0[1];
    __u64 stx_ino;
    __u64 stx_size;
    __u64 stx_blocks;
    __u64 __spare1[1];
    struct statx_timestamp stx_atime;
    struct statx_timestamp stx_btime;
    struct statx_timestamp stx_ctime;
    struct statx_timestamp stx_mtime;
    __u32 stx_rdev_major;
    __u32 stx_rdev_minor;
    __u32 stx_dev_major;
    __u32 stx_dev_minor;
    __u64 __spare2[14];
    };

    The defined bits in request_mask and stx_mask are:

    STATX_TYPE Want/got stx_mode & S_IFMT
    STATX_MODE Want/got stx_mode & ~S_IFMT
    STATX_NLINK Want/got stx_nlink
    STATX_UID Want/got stx_uid
    STATX_GID Want/got stx_gid
    STATX_ATIME Want/got stx_atime{,_ns}
    STATX_MTIME Want/got stx_mtime{,_ns}
    STATX_CTIME Want/got stx_ctime{,_ns}
    STATX_INO Want/got stx_ino
    STATX_SIZE Want/got stx_size
    STATX_BLOCKS Want/got stx_blocks
    STATX_BASIC_STATS [The stuff in the normal stat struct]
    STATX_BTIME Want/got stx_btime{,_ns}
    STATX_ALL [All currently available stuff]

    stx_btime is the file creation time, stx_mask is a bitmask indicating the
    data provided and __spares*[] are where as-yet undefined fields can be
    placed.

    Time fields are structures with separate seconds and nanoseconds fields
    plus a reserved field in case we want to add even finer resolution. Note
    that times will be negative if before 1970; in such a case, the nanosecond
    fields will also be negative if not zero.

    The bits defined in the stx_attributes field convey information about a
    file, how it is accessed, where it is and what it does. The following
    attributes map to FS_*_FL flags and are the same numerical value:

    STATX_ATTR_COMPRESSED File is compressed by the fs
    STATX_ATTR_IMMUTABLE File is marked immutable
    STATX_ATTR_APPEND File is append-only
    STATX_ATTR_NODUMP File is not to be dumped
    STATX_ATTR_ENCRYPTED File requires key to decrypt in fs

    Within the kernel, the supported flags are listed by:

    KSTAT_ATTR_FS_IOC_FLAGS

    [Are any other IOC flags of sufficient general interest to be exposed
    through this interface?]

    New flags include:

    STATX_ATTR_AUTOMOUNT Object is an automount trigger

    These are for the use of GUI tools that might want to mark files specially,
    depending on what they are.

    Fields in struct statx come in a number of classes:

    (0) stx_dev_*, stx_blksize.

    These are local system information and are always available.

    (1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
    stx_size, stx_blocks.

    These will be returned whether the caller asks for them or not. The
    corresponding bits in stx_mask will be set to indicate whether they
    actually have valid values.

    If the caller didn't ask for them, then they may be approximated. For
    example, NFS won't waste any time updating them from the server,
    unless as a byproduct of updating something requested.

    If the values don't actually exist for the underlying object (such as
    UID or GID on a DOS file), then the bit won't be set in the stx_mask,
    even if the caller asked for the value. In such a case, the returned
    value will be a fabrication.

    Note that there are instances where the type might not be valid, for
    instance Windows reparse points.

    (2) stx_rdev_*.

    This will be set only if stx_mode indicates we're looking at a
    blockdev or a chardev, otherwise will be 0.

    (3) stx_btime.

    Similar to (1), except this will be set to 0 if it doesn't exist.

    =======
    TESTING
    =======

    The following test program can be used to test the statx system call:

    samples/statx/test-statx.c

    Just compile and run, passing it paths to the files you want to examine.
    The file is built automatically if CONFIG_SAMPLES is enabled.

    Here's some example output. Firstly, an NFS directory that crosses to
    another FSID. Note that the AUTOMOUNT attribute is set because transiting
    this directory will cause d_automount to be invoked by the VFS.

    [root@andromeda ~]# /tmp/test-statx -A /warthog/data
    statx(/warthog/data) = 0
    results=7ff
    Size: 4096 Blocks: 8 IO Block: 1048576 directory
    Device: 00:26 Inode: 1703937 Links: 125
    Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
    Access: 2016-11-24 09:02:12.219699527+0000
    Modify: 2016-11-17 10:44:36.225653653+0000
    Change: 2016-11-17 10:44:36.225653653+0000
    Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)

    Secondly, the result of automounting on that directory.

    [root@andromeda ~]# /tmp/test-statx /warthog/data
    statx(/warthog/data) = 0
    results=7ff
    Size: 4096 Blocks: 8 IO Block: 1048576 directory
    Device: 00:27 Inode: 2 Links: 125
    Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
    Access: 2016-11-24 09:02:12.219699527+0000
    Modify: 2016-11-17 10:44:36.225653653+0000
    Change: 2016-11-17 10:44:36.225653653+0000

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     

02 Mar, 2017

3 commits


28 Feb, 2017

1 commit

  • Replace all 1 << inode->i_blkbits and (1 << inode->i_blkbits) in fs
    branch.

    This patch also fixes multiple checkpatch warnings: WARNING: Prefer
    'unsigned int' to bare use of 'unsigned'

    Thanks to Andrew Morton for suggesting more appropriate function instead
    of macro.

    [geliangtang@gmail.com: truncate: use i_blocksize()]
    Link: http://lkml.kernel.org/r/9c8b2cd83c8f5653805d43debde9fa8817e02fc4.1484895804.git.geliangtang@gmail.com
    Link: http://lkml.kernel.org/r/1481319905-10126-1-git-send-email-fabf@skynet.be
    Signed-off-by: Fabian Frederick
    Signed-off-by: Geliang Tang
    Cc: Alexander Viro
    Cc: Ross Zwisler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     

25 Feb, 2017

3 commits

  • Since the introduction of FAULT_FLAG_SIZE to the vm_fault flag, it has
    been somewhat painful with getting the flags set and removed at the
    correct locations. More than one kernel oops was introduced due to
    difficulties of getting the placement correctly.

    Remove the flag values and introduce an input parameter to huge_fault
    that indicates the size of the page entry. This makes the code easier
    to trace and should avoid the issues we see with the fault flags where
    removal of the flag was necessary in the fallback paths.

    Link: http://lkml.kernel.org/r/148615748258.43180.1690152053774975329.stgit@djiang5-desk3.ch.intel.com
    Signed-off-by: Dave Jiang
    Tested-by: Dan Williams
    Reviewed-by: Jan Kara
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Vlastimil Babka
    Cc: Ross Zwisler
    Cc: Kirill A. Shutemov
    Cc: Nilesh Choudhury
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jiang
     
  • Patch series "1G transparent hugepage support for device dax", v2.

    The following series implements support for 1G trasparent hugepage on
    x86 for device dax. The bulk of the code was written by Mathew Wilcox a
    while back supporting transparent 1G hugepage for fs DAX. I have
    forward ported the relevant bits to 4.10-rc. The current submission has
    only the necessary code to support device DAX.

    Comments from Dan Williams: So the motivation and intended user of this
    functionality mirrors the motivation and users of 1GB page support in
    hugetlbfs. Given expected capacities of persistent memory devices an
    in-memory database may want to reduce tlb pressure beyond what they can
    already achieve with 2MB mappings of a device-dax file. We have
    customer feedback to that effect as Willy mentioned in his previous
    version of these patches [1].

    [1]: https://lkml.org/lkml/2016/1/31/52

    Comments from Nilesh @ Oracle:

    There are applications which have a process model; and if you assume
    10,000 processes attempting to mmap all the 6TB memory available on a
    server; we are looking at the following:

    processes : 10,000
    memory : 6TB
    pte @ 4k page size: 8 bytes / 4K of memory * #processes = 6TB / 4k * 8 * 10000 = 1.5GB * 80000 = 120,000GB
    pmd @ 2M page size: 120,000 / 512 = ~240GB
    pud @ 1G page size: 240GB / 512 = ~480MB

    As you can see with 2M pages, this system will use up an exorbitant
    amount of DRAM to hold the page tables; but the 1G pages finally brings
    it down to a reasonable level. Memory sizes will keep increasing; so
    this number will keep increasing.

    An argument can be made to convert the applications from process model
    to thread model, but in the real world that may not be always practical.
    Hopefully this helps explain the use case where this is valuable.

    This patch (of 3):

    In preparation for adding the ability to handle PUD pages, convert
    vm_operations_struct.pmd_fault to vm_operations_struct.huge_fault. The
    vm_fault structure is extended to include a union of the different page
    table pointers that may be needed, and three flag bits are reserved to
    indicate which type of pointer is in the union.

    [ross.zwisler@linux.intel.com: remove unused function ext4_dax_huge_fault()]
    Link: http://lkml.kernel.org/r/1485813172-7284-1-git-send-email-ross.zwisler@linux.intel.com
    [dave.jiang@intel.com: clear PMD or PUD size flags when in fall through path]
    Link: http://lkml.kernel.org/r/148589842696.5820.16078080610311444794.stgit@djiang5-desk3.ch.intel.com
    Link: http://lkml.kernel.org/r/148545058784.17912.6353162518188733642.stgit@djiang5-desk3.ch.intel.com
    Signed-off-by: Matthew Wilcox
    Signed-off-by: Dave Jiang
    Signed-off-by: Ross Zwisler
    Cc: Dave Hansen
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Dan Williams
    Cc: Kirill A. Shutemov
    Cc: Nilesh Choudhury
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Cc: Dave Jiang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jiang
     
  • ->fault(), ->page_mkwrite(), and ->pfn_mkwrite() calls do not need to
    take a vma and vmf parameter when the vma already resides in vmf.

    Remove the vma parameter to simplify things.

    [arnd@arndb.de: fix ARM build]
    Link: http://lkml.kernel.org/r/20170125223558.1451224-1-arnd@arndb.de
    Link: http://lkml.kernel.org/r/148521301778.19116.10840599906674778980.stgit@djiang5-desk3.ch.intel.com
    Signed-off-by: Dave Jiang
    Signed-off-by: Arnd Bergmann
    Reviewed-by: Ross Zwisler
    Cc: Theodore Ts'o
    Cc: Darrick J. Wong
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Christoph Hellwig
    Cc: Jan Kara
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jiang
     

23 Feb, 2017

4 commits

  • Merge updates from Andrew Morton:
    "142 patches:

    - DAX updates

    - various misc bits

    - OCFS2 updates

    - most of MM"

    * emailed patches from Andrew Morton : (142 commits)
    mm/z3fold.c: limit first_num to the actual range of possible buddy indexes
    mm: fix stray kernel-doc notation
    zram: remove obsolete sysfs attrs
    mm/memblock.c: remove unnecessary log and clean up
    oom-reaper: use madvise_dontneed() logic to decide if unmap the VMA
    mm: drop unused argument of zap_page_range()
    mm: drop zap_details::check_swap_entries
    mm: drop zap_details::ignore_dirty
    mm, page_alloc: warn_alloc nodemask is NULL when cpusets are disabled
    mm: help __GFP_NOFAIL allocations which do not trigger OOM killer
    mm, oom: do not enforce OOM killer for __GFP_NOFAIL automatically
    mm: consolidate GFP_NOFAIL checks in the allocator slowpath
    lib/show_mem.c: teach show_mem to work with the given nodemask
    arch, mm: remove arch specific show_mem
    mm, page_alloc: warn_alloc print nodemask
    mm, page_alloc: do not report all nodes in show_mem
    Revert "mm: bail out in shrink_inactive_list()"
    mm, vmscan: consider eligible zones in get_scan_count
    mm, vmscan: cleanup lru size claculations
    mm, vmscan: do not count freed pages as PGDEACTIVATE
    ...

    Linus Torvalds
     
  • Pull xfs updates from Darrick Wong:
    "Here are the XFS changes for 4.11. We aren't introducing any major
    features in this release cycle except for this being the first merge
    window I've managed on my own. :)

    Changes since last update:

    - Various cleanups

    - Livelock fixes for eofblocks scanning

    - Improved input verification for on-disk metadata

    - Fix races in the copy on write remap mechanism

    - Fix buffer io error timeout controls

    - Streamlining of directio copy on write

    - Asynchronous discard support

    - Fix asserts when splitting delalloc reservations

    - Don't bloat bmbt when right shifting extents

    - Inode alignment fixes for 32k block sizes"

    * tag 'xfs-4.11-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (39 commits)
    xfs: remove XFS_ALLOCTYPE_ANY_AG and XFS_ALLOCTYPE_START_AG
    xfs: simplify xfs_rtallocate_extent
    xfs: tune down agno asserts in the bmap code
    xfs: Use xfs_icluster_size_fsb() to calculate inode chunk alignment
    xfs: don't reserve blocks for right shift transactions
    xfs: fix len comparison in xfs_extent_busy_trim
    xfs: fix uninitialized variable in _reflink_convert_cow
    xfs: split indlen reservations fairly when under reserved
    xfs: handle indlen shortage on delalloc extent merge
    xfs: resurrect debug mode drop buffered writes mechanism
    xfs: clear delalloc and cache on buffered write failure
    xfs: don't block the log commit handler for discards
    xfs: improve busy extent sorting
    xfs: improve handling of busy extents in the low-level allocator
    xfs: don't fail xfs_extent_busy allocation
    xfs: correct null checks and error processing in xfs_initialize_perag
    xfs: update ctime and mtime on clone destinatation inodes
    xfs: allocate direct I/O COW blocks in iomap_begin
    xfs: go straight to real allocations for direct I/O COW writes
    xfs: return the converted extent in __xfs_reflink_convert_cow
    ...

    Linus Torvalds
     
  • pmd_fault() and related functions really only need the vmf parameter since
    the additional parameters are all included in the vmf struct. Remove the
    additional parameter and simplify pmd_fault() and friends.

    Link: http://lkml.kernel.org/r/1484085142-2297-8-git-send-email-ross.zwisler@linux.intel.com
    Signed-off-by: Dave Jiang
    Reviewed-by: Ross Zwisler
    Reviewed-by: Jan Kara
    Cc: Dave Chinner
    Cc: Dave Jiang
    Cc: Matthew Wilcox
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jiang
     
  • Instead of passing in multiple parameters in the pmd_fault() handler,
    a vmf can be passed in just like a fault() handler. This will simplify
    code and remove the need for the actual pmd fault handlers to allocate a
    vmf. Related functions are also modified to do the same.

    [dave.jiang@intel.com: fix issue with xfs_tests stall when DAX option is off]
    Link: http://lkml.kernel.org/r/148469861071.195597.3619476895250028518.stgit@djiang5-desk3.ch.intel.com
    Link: http://lkml.kernel.org/r/1484085142-2297-7-git-send-email-ross.zwisler@linux.intel.com
    Signed-off-by: Dave Jiang
    Reviewed-by: Ross Zwisler
    Reviewed-by: Jan Kara
    Cc: Dave Chinner
    Cc: Matthew Wilcox
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jiang
     

18 Feb, 2017

3 commits


17 Feb, 2017

9 commits

  • In various places we currently assert that xfs_bmap_btalloc allocates
    from the same as the firstblock value passed in, unless it's either
    NULLAGNO or the dop_low flag is set. But the reflink code does not
    fully follow this convention as it passes in firstblock purely as
    a hint for the allocator without actually having previous allocations
    in the transaction, and without having a minleft check on the current
    AG, leading to the assert firing on a very full and heavily used
    file system. As even the reflink code only allocates from equal or
    higher AGs for now we can simply the check to always allow for equal
    or higher AGs.

    Note that we need to eventually split the two meanings of the firstblock
    value. At that point we can also allow the reflink code to allocate
    from any AG instead of limiting it in any way.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • On a ppc64 system, executing generic/256 test with 32k block size gives the following call trace,

    XFS: Assertion failed: args->maxlen > 0, file: /root/repos/linux/fs/xfs/libxfs/xfs_alloc.c, line: 2026

    kernel BUG at /root/repos/linux/fs/xfs/xfs_message.c:113!
    Oops: Exception in kernel mode, sig: 5 [#1]
    SMP NR_CPUS=2048
    DEBUG_PAGEALLOC
    NUMA
    pSeries
    Modules linked in:
    CPU: 2 PID: 19361 Comm: mkdir Not tainted 4.10.0-rc5 #58
    task: c000000102606d80 task.stack: c0000001026b8000
    NIP: c0000000004ef798 LR: c0000000004ef798 CTR: c00000000082b290
    REGS: c0000001026bb090 TRAP: 0700 Not tainted (4.10.0-rc5)
    MSR: 8000000000029032
    CR: 28004428 XER: 00000000
    CFAR: c0000000004ef180 SOFTE: 1
    GPR00: c0000000004ef798 c0000001026bb310 c000000001157300 ffffffffffffffea
    GPR04: 000000000000000a c0000001026bb130 0000000000000000 ffffffffffffffc0
    GPR08: 00000000000000d1 0000000000000021 00000000ffffffd1 c000000000dd4990
    GPR12: 0000000022004444 c00000000fe00800 0000000020000000 0000000000000000
    GPR16: 0000000000000000 0000000043a606fc 0000000043a76c08 0000000043a1b3d0
    GPR20: 000001002a35cd60 c0000001026bbb80 0000000000000000 0000000000000001
    GPR24: 0000000000000240 0000000000000004 c00000062dc55000 0000000000000000
    GPR28: 0000000000000004 c00000062ecd9200 0000000000000000 c0000001026bb6c0
    NIP [c0000000004ef798] .assfail+0x28/0x30
    LR [c0000000004ef798] .assfail+0x28/0x30
    Call Trace:
    [c0000001026bb310] [c0000000004ef798] .assfail+0x28/0x30 (unreliable)
    [c0000001026bb380] [c000000000455d74] .xfs_alloc_space_available+0x194/0x1b0
    [c0000001026bb410] [c00000000045b914] .xfs_alloc_fix_freelist+0x144/0x480
    [c0000001026bb580] [c00000000045c368] .xfs_alloc_vextent+0x698/0xa90
    [c0000001026bb650] [c0000000004a6200] .xfs_ialloc_ag_alloc+0x170/0x820
    [c0000001026bb7c0] [c0000000004a9098] .xfs_dialloc+0x158/0x320
    [c0000001026bb8a0] [c0000000004e628c] .xfs_ialloc+0x7c/0x610
    [c0000001026bb990] [c0000000004e8138] .xfs_dir_ialloc+0xa8/0x2f0
    [c0000001026bbaa0] [c0000000004e8814] .xfs_create+0x494/0x790
    [c0000001026bbbf0] [c0000000004e5ebc] .xfs_generic_create+0x2bc/0x410
    [c0000001026bbce0] [c0000000002b4a34] .vfs_mkdir+0x154/0x230
    [c0000001026bbd70] [c0000000002bc444] .SyS_mkdirat+0x94/0x120
    [c0000001026bbe30] [c00000000000b760] system_call+0x38/0xfc
    Instruction dump:
    4e800020 60000000 7c0802a6 7c862378 3c82ffca 7ca72b78 38841c18 7c651b78
    38600000 f8010010 f821ff91 4bfff94d 60000000 7c0802a6 7c892378

    When block size is larger than inode cluster size, the call to
    XFS_B_TO_FSBT(mp, mp->m_inode_cluster_size) returns 0. Also, mkfs.xfs
    would have set xfs_sb->sb_inoalignmt to 0. This causes
    xfs_ialloc_cluster_alignment() to return 0. Due to this
    args.minalignslop (in xfs_ialloc_ag_alloc()) gets the unsigned
    equivalent of -1 assigned to it. This later causes alloc_len in
    xfs_alloc_space_available() to have a value of 0. In such a scenario
    when args.total is also 0, the assert statement "ASSERT(args->maxlen >
    0);" fails.

    This commit fixes the bug by replacing the call to XFS_B_TO_FSBT() in
    xfs_ialloc_cluster_alignment() with a call to xfs_icluster_size_fsb().

    Suggested-by: Darrick J. Wong
    Signed-off-by: Chandan Rajendra
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Chandan Rajendra
     
  • The block reservation for the transaction allocated in
    xfs_shift_file_space() is an artifact of the original collapse range
    support. It exists to handle the case where a collapse range occurs,
    the initial extent is left shifted into a location that forms a
    contiguous boundary with the previous extent and thus the extents
    are merged. This code was subsequently refactored and reused for
    insert range (right shift) support.

    If an insert range occurs under low free space conditions, the
    extent at the starting offset is split before the first shift
    transaction is allocated. If the block reservation fails, this
    leaves separate, but contiguous extents around in the inode. While
    not a fatal problem, this is unexpected and will flag a warning on
    subsequent insert range operations on the inode. This problem has
    been reproduce intermittently by generic/270 running against a
    ramdisk device.

    Since right shift does not create new extent boundaries in the
    inode, a block reservation for extent merge is unnecessary. Update
    xfs_shift_file_space() to conditionally reserve fs blocks for left
    shift transactions only. This avoids the warning reproduced by
    generic/270.

    Reported-by: Ross Zwisler
    Signed-off-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Brian Foster
     
  • The length is now passed by reference, so the assertion has to be updated
    to match the other changes, as pointed out by this W=1 warning:

    fs/xfs/xfs_extent_busy.c: In function 'xfs_extent_busy_trim':
    fs/xfs/xfs_extent_busy.c:356:13: error: ordered comparison of pointer with integer zero [-Werror=extra]

    Fixes: ebf55872616c ("xfs: improve handling of busy extents in the low-level allocator")
    Signed-off-by: Arnd Bergmann
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Arnd Bergmann
     
  • Fix an uninitialize variable.

    Reported-by: Dan Carpenter
    Reviewed-by: Brian Foster
    Signed-off-by: Darrick J. Wong

    Darrick J. Wong
     
  • Certain workoads that punch holes into speculative preallocation can
    cause delalloc indirect reservation splits when the delalloc extent is
    split in two. If further splits occur, an already short-handed extent
    can be split into two in a manner that leaves zero indirect blocks for
    one of the two new extents. This occurs because the shortage is large
    enough that the xfs_bmap_split_indlen() algorithm completely drains the
    requested indlen of one of the extents before it honors the existing
    reservation.

    This ultimately results in a warning from xfs_bmap_del_extent(). This
    has been observed during file copies of large, sparse files using 'cp
    --sparse=always.'

    To avoid this problem, update xfs_bmap_split_indlen() to explicitly
    apply the reservation shortage fairly between both extents. This smooths
    out the overall indlen shortage and defers the situation where we end up
    with a delalloc extent with zero indlen reservation to extreme
    circumstances.

    Reported-by: Patrick Dung
    Signed-off-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Brian Foster
     
  • When a delalloc extent is created, it can be merged with pre-existing,
    contiguous, delalloc extents. When this occurs,
    xfs_bmap_add_extent_hole_delay() merges the extents along with the
    associated indirect block reservations. The expectation here is that the
    combined worst case indlen reservation is always less than or equal to
    the indlen reservation for the individual extents.

    This is not always the case, however, as existing extents can less than
    the expected indlen reservation if the extent was previously split due
    to a hole punch. If a new extent merges with such an extent, the total
    indlen requirement may be larger than the sum of the indlen reservations
    held by both extents.

    xfs_bmap_add_extent_hole_delay() assumes that the worst case indlen
    reservation is always available and assigns it to the merged extent
    without consideration for the indlen held by the pre-existing extent. As
    a result, the subsequent xfs_mod_fdblocks() call can attempt an
    unintentional allocation rather than a free (indicated by an ASSERT()
    failure). Further, if the allocation happens to fail in this context,
    the failure goes unhandled and creates a filesystem wide block
    accounting inconsistency.

    Fix xfs_bmap_add_extent_hole_delay() to function as designed. Cap the
    indlen reservation assigned to the merged extent to the sum of the
    indlen reservations held by each of the individual extents.

    Signed-off-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Brian Foster
     
  • A debug mode write failure mechanism was introduced to XFS in commit
    801cc4e17a ("xfs: debug mode forced buffered write failure") to
    facilitate targeted testing of delalloc indirect reservation management
    from userspace. This code was subsequently rendered ineffective by the
    move to iomap based buffered writes in commit 68a9f5e700 ("xfs:
    implement iomap based buffered write path"). This likely went unnoticed
    because the associated userspace code had not made it into xfstests.

    Resurrect this mechanism to facilitate effective indlen reservation
    testing from xfstests. The move to iomap based buffered writes relocated
    the hook this mechanism needs to return write failure from XFS to
    generic code. The failure trigger must remain in XFS. Given that
    limitation, convert this from a write failure mechanism to one that
    simply drops writes without returning failure to userspace. Rename all
    "fail_writes" references to "drop_writes" to illustrate the point. This
    is more hacky than preferred, but still triggers the XFS error handling
    behavior required to drive the indlen tests. This is only available in
    DEBUG mode and for testing purposes only.

    Signed-off-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Brian Foster
     
  • The buffered write failure handling code in
    xfs_file_iomap_end_delalloc() has a couple minor problems. First, if
    written == 0, start_fsb is not rounded down and it fails to kill off a
    delalloc block if the start offset is block unaligned. This results in a
    lingering delalloc block and broken delalloc block accounting detected
    at unmount time. Fix this by rounding down start_fsb in the unlikely
    event that written == 0.

    Second, it is possible for a failed overwrite of a delalloc extent to
    leave dirty pagecache around over a hole in the file. This is because is
    possible to hit ->iomap_end() on write failure before the iomap code has
    attempted to allocate pagecache, and thus has no need to clean it up. If
    the targeted delalloc extent was successfully written by a previous
    write, however, then it does still have dirty pages when ->iomap_end()
    punches out the underlying blocks. This ultimately results in writeback
    over a hole. To fix this problem, unconditionally punch out the
    pagecache from XFS before the associated delalloc range.

    Signed-off-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Brian Foster
     

10 Feb, 2017

6 commits

  • Instead we submit the discard requests and use another workqueue to
    release the extents from the extent busy list.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • Sort busy extents by the full block number instead of just the AGNO so
    that we can issue consecutive discard requests that the block layer could
    merge (although we'll need additional block layer fixes for fast devices).

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • Currently we force the log and simply try again if we hit a busy extent,
    but especially with online discard enabled it might take a while after
    the log force for the busy extents to disappear, and we might have
    already completed our second pass.

    So instead we add a new waitqueue and a generation counter to the pag
    structure so that we can do wakeups once we've removed busy extents,
    and we replace the single retry with an unconditional one - after
    all we hold the AGF buffer lock, so no other allocations or frees
    can be racing with us in this AG.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • We don't just need the structure to track busy extents which can be
    avoided with a synchronous transaction, but also to keep track of
    pending discard.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • If pag cannot be allocated, the current error exit path will trip
    a null pointer deference error when calling xfs_buf_hash_destroy
    with a null pag. Fix this by adding a new error exit labels and
    jumping to those accordingly, avoiding the hash destroy and
    unnecessary kmem_free on pag.

    Up to three things need to be properly unwound:

    1) pag memory allocation
    2) xfs_buf_hash_init
    3) radix_tree_insert

    For any given iteration through the loop, any of the above which
    succeed must be unwound for /this/ pag, and then all prior
    initialized pags must be unwound.

    Addresses-Coverity-Id: 1397628 ("Dereference after null check")

    Reported-by: Colin Ian King
    Signed-off-by: Bill O'Donnell
    Reviewed-by: Eric Sandeen
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Bill O'Donnell
     
  • We're changing both metadata and data, so we need to update the
    timestamps for clone operations. Dedupe on the other hand does
    not change file data, and only changes invisible metadata so the
    timestamps should not be updated.

    This follows existing btrfs behavior.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    [darrick: remove redundant is_dedupe test]
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     

07 Feb, 2017

2 commits

  • Instead of preallocating all the required COW blocks in the high-level
    write code do it inside the iomap code, like we do for all other I/O.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • When we allocate COW fork blocks for direct I/O writes we currently first
    create a delayed allocation, and then convert it to a real allocation
    once we've got the delayed one.

    As there is no good reason for that this patch instead makes use call
    xfs_bmapi_write from the COW allocation path. The only interesting bits
    are a few tweaks the low-level allocator to allow for this, most notably
    the need to remove the call to xfs_bmap_extsize_align for the cowextsize
    in xfs_bmap_btalloc - for the existing convert case it's a no-op, but
    for the direct allocation case it would blow up our block reservation
    way beyond what we reserved for the transaction.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig