07 Oct, 2017

3 commits


06 Oct, 2017

1 commit

  • Pull overlayfs fixes from Miklos Szeredi:
    "Fix a regression in 4.14 and one in 4.13. The latter is a case when
    Docker is doing something it really shouldn't and gets away with it.
    We now print a warning instead of erroring out.

    There are also fixes to several error paths"

    * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
    ovl: fix regression caused by exclusive upper/work dir protection
    ovl: fix missing unlock_rename() in ovl_do_copy_up()
    ovl: fix dentry leak in ovl_indexdir_cleanup()
    ovl: fix dput() of ERR_PTR in ovl_cleanup_index()
    ovl: fix error value printed in ovl_lookup_index()
    ovl: fix may_write_real() for overlayfs directories

    Linus Torvalds
     

05 Oct, 2017

7 commits

  • Enforcing exclusive ownership on upper/work dirs caused a docker
    regression: https://github.com/moby/moby/issues/34672.

    Euan spotted the regression and pointed to the offending commit.
    Vivek has brought the regression to my attention and provided this
    reproducer:

    Terminal 1:

    mount -t overlay -o workdir=work,lowerdir=lower,upperdir=upper none
    merged/

    Terminal 2:

    unshare -m

    Terminal 1:

    umount merged
    mount -t overlay -o workdir=work,lowerdir=lower,upperdir=upper none
    merged/
    mount: /root/overlay-testing/merged: none already mounted or mount point
    busy

    To fix the regression, I replaced the error with an alarming warning.
    With index feature enabled, mount does fail, but logs a suggestion to
    override exclusive dir protection by disabling index.
    Note that index=off mount does take the inuse locks, so a concurrent
    index=off will issue the warning and a concurrent index=on mount will fail.

    Documentation was updated to reflect this change.

    Fixes: 2cac0c00a6cd ("ovl: get exclusive ownership on upper/work dirs")
    Cc: # v4.13
    Reported-by: Euan Kemp
    Reported-by: Vivek Goyal
    Signed-off-by: Amir Goldstein
    Signed-off-by: Miklos Szeredi

    Amir Goldstein
     
  • Use the ovl_lock_rename_workdir() helper which requires
    unlock_rename() only on lock success.

    Fixes: ("fd210b7d67ee ovl: move copy up lock out")
    Cc: # v4.13
    Signed-off-by: Amir Goldstein
    Signed-off-by: Miklos Szeredi

    Amir Goldstein
     
  • index dentry was not released when breaking out of the loop
    due to index verification error.

    Fixes: 415543d5c64f ("ovl: cleanup bad and stale index entries on mount")
    Cc: # v4.13
    Signed-off-by: Amir Goldstein
    Signed-off-by: Miklos Szeredi

    Amir Goldstein
     
  • Fixes: caf70cb2ba5d ("ovl: cleanup orphan index entries")
    Cc: # v4.13
    Signed-off-by: Amir Goldstein
    Signed-off-by: Miklos Szeredi

    Amir Goldstein
     
  • Fixes: 359f392ca53e ("ovl: lookup index entry for copy up origin")
    Cc: # v4.13
    Signed-off-by: Amir Goldstein
    Signed-off-by: Miklos Szeredi

    Amir Goldstein
     
  • Overlayfs directory file_inode() is the overlay inode whether the real
    inode is upper or lower.

    This fixes a regression in xfstest generic/158.

    Fixes: 7c6893e3c9ab ("ovl: don't allow writing ioctl on lower layer")
    Signed-off-by: Amir Goldstein
    Signed-off-by: Miklos Szeredi

    Amir Goldstein
     
  • Merge misc fixes from Andrew Morton:
    "A lot of stuff, sorry about that. A week on a beach, then a bunch of
    time catching up then more time letting it bake in -next. Shan't do
    that again!"

    * emailed patches from Andrew Morton : (51 commits)
    include/linux/fs.h: fix comment about struct address_space
    checkpatch: fix ignoring cover-letter logic
    m32r: fix build failure
    lib/ratelimit.c: use deferred printk() version
    kernel/params.c: improve STANDARD_PARAM_DEF readability
    kernel/params.c: fix an overflow in param_attr_show
    kernel/params.c: fix the maximum length in param_get_string
    mm/memory_hotplug: define find_{smallest|biggest}_section_pfn as unsigned long
    mm/memory_hotplug: change pfn_to_section_nr/section_nr_to_pfn macro to inline function
    kernel/kcmp.c: drop branch leftover typo
    memremap: add scheduling point to devm_memremap_pages
    mm, page_alloc: add scheduling point to memmap_init_zone
    mm, memory_hotplug: add scheduling point to __add_pages
    lib/idr.c: fix comment for idr_replace()
    mm: memcontrol: use vmalloc fallback for large kmem memcg arrays
    kernel/sysctl.c: remove duplicate UINT_MAX check on do_proc_douintvec_conv()
    include/linux/bitfield.h: remove 32bit from FIELD_GET comment block
    lib/lz4: make arrays static const, reduces object code size
    exec: binfmt_misc: kill the onstack iname[BINPRM_BUF_SIZE] array
    exec: binfmt_misc: fix race between load_misc_binary() and kill_node()
    ...

    Linus Torvalds
     

04 Oct, 2017

12 commits

  • Because the values of BTRFS_FS_EXCL_OP and BTRFS_FS_QUOTA_OVERRIDE overlap,
    we should change the value.

    First, BTRFS_FS_EXCL_OP was set to 14.

    commit 171938e52807 ("btrfs: track exclusive filesystem operation in flags")

    Next, the value of BTRFS_FS_QUOTA_OVERRIDE was set to 14.

    commit f29efe292198 ("btrfs: add quota override flag to enable quota override for CAP_SYS_RESOURCE")

    As a result, the value 14 overlapped, by accident.
    This problem is solved by defining the value of BTRFS_FS_EXCL_OP as 16,
    the flags are internal.

    Fixes: f29efe292198 ("btrfs: add quota override flag to enable quota override for CAP_SYS_RESOURCE")
    CC: stable@vger.kernel.org # 4.13+
    Signed-off-by: Tsutomu Itoh
    Reviewed-by: David Sterba
    [ minimize the change, update only BTRFS_FS_EXCL_OP ]
    Signed-off-by: David Sterba

    Tsutomu Itoh
     
  • Jean-Denis Girard noticed commit c821e7f3 "pass bytes to
    btrfs_bio_alloc" (https://patchwork.kernel.org/patch/9763081/)
    introduces a regression on 32 bit machines.
    When CONFIG_LBDAF is _not_ defined (CONFIG_LBDAF == Support for large
    (2TB+) block devices and files) sector_t is 32 bit on 32bit machines.

    In the function submit_extent_page, 'sector' (which is sector_t type) is
    multiplied by 512 to convert it from sectors to bytes, leading to an
    overflow when the disk is bigger than 4GB (!).

    I added a cast to u64 to avoid overflow.

    Fixes: c821e7f3 ("btrfs: pass bytes to btrfs_bio_alloc")
    CC: stable@vger.kernel.org # 4.13+
    Signed-off-by: Goffredo Baroncelli
    Tested-by: Jean-Denis Girard
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Goffredo Baroncelli
     
  • security_inode_getsecurity() provides the text string value
    of a security attribute. It does not provide a "secctx".
    The code in xattr_getsecurity() that calls security_inode_getsecurity()
    and then calls security_release_secctx() happened to work because
    SElinux and Smack treat the attribute and the secctx the same way.
    It fails for cap_inode_getsecurity(), because that module has no
    secctx that ever needs releasing. It turns out that Smack is the
    one that's doing things wrong by not allocating memory when instructed
    to do so by the "alloc" parameter.

    The fix is simple enough. Change the security_release_secctx() to
    kfree() because it isn't a secctx being returned by
    security_inode_getsecurity(). Change Smack to allocate the string when
    told to do so.

    Note: this also fixes memory leaks for LSMs which implement
    inode_getsecurity but not release_secctx, such as capabilities.

    Signed-off-by: Casey Schaufler
    Reported-by: Konstantin Khlebnikov
    Cc: stable@vger.kernel.org
    Signed-off-by: James Morris

    Casey Schaufler
     
  • If we got two AIO writes into a COW area the second one might not have any
    COW extents left to convert. Handle that case gracefully instead of
    triggering an assert or accessing beyond the bounds of the extent list.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • Since the CoW fork exists as a secondary data structure to the data
    fork, we must always swap cow forks during swapext. We also need to
    swap the extent counts and reset the cowblocks tags.

    Reviewed-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Darrick J. Wong

    Darrick J. Wong
     
  • After the previous change "fmt" can't go away, we can kill
    iname/iname_addr and use fmt->interpreter.

    Link: http://lkml.kernel.org/r/20170922143653.GA17232@redhat.com
    Signed-off-by: Oleg Nesterov
    Acked-by: Kees Cook
    Cc: Al Viro
    Cc: Ben Woodard
    Cc: James Bottomley
    Cc: Jim Foraker
    Cc:
    Cc: Travis Gummels
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • load_misc_binary() makes a local copy of fmt->interpreter under
    entries_lock to avoid the race with kill_node() but this is not enough;
    the whole Node can be freed after we drop entries_lock, not only the
    ->interpreter string.

    Add dget/dput(fmt->dentry) to ensure bm_evict_inode() can't destroy/free
    this Node.

    Link: http://lkml.kernel.org/r/20170922143650.GA17227@redhat.com
    Signed-off-by: Oleg Nesterov
    Acked-by: Kees Cook
    Cc: Al Viro
    Cc: Ben Woodard
    Cc: James Bottomley
    Cc: Jim Foraker
    Cc: Travis Gummels
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • If MISC_FMT_OPEN_FILE flag is set e->interp_file must be valid or we
    have a bug which should not be silently ignored.

    Link: http://lkml.kernel.org/r/20170922143647.GA17222@redhat.com
    Signed-off-by: Oleg Nesterov
    Acked-by: Kees Cook
    Cc: Al Viro
    Cc: Ben Woodard
    Cc: James Bottomley
    Cc: Jim Foraker
    Cc:
    Cc: Travis Gummels
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • To ensure that load_misc_binary() can't use the partially destroyed
    Node, see also the next patch.

    The current logic looks wrong in any case, once we close interp_file it
    doesn't make any sense to delay kfree(inode->i_private), this Node is no
    longer valid. Even if the MISC_FMT_OPEN_FILE/interp_file checks were
    not racy (they are), load_misc_binary() should not try to reopen
    ->interpreter if MISC_FMT_OPEN_FILE is set but ->interp_file is NULL.

    And I can't understand why do we use filp_close(), not fput().

    Link: http://lkml.kernel.org/r/20170922143644.GA17216@redhat.com
    Signed-off-by: Oleg Nesterov
    Acked-by: Kees Cook
    Cc: Al Viro
    Cc: Ben Woodard
    Cc: James Bottomley
    Cc: Jim Foraker
    Cc:
    Cc: Travis Gummels
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • kill_node() nullifies/checks Node->dentry to avoid double free. This
    complicates the next changes and this is very confusing:

    - we do not need to check dentry != NULL under entries_lock,
    kill_node() is always called under inode_lock(d_inode(root)) and we
    rely on this inode_lock() anyway, without this lock the
    MISC_FMT_OPEN_FILE cleanup could race with itself.

    - if kill_inode() was already called and ->dentry == NULL we should not
    even try to close e->interp_file.

    We can change bm_entry_write() to simply check !list_empty(list) before
    kill_node. Again, we rely on inode_lock(), in particular it saves us
    from the race with bm_status_write(), another caller of kill_node().

    Link: http://lkml.kernel.org/r/20170922143641.GA17210@redhat.com
    Signed-off-by: Oleg Nesterov
    Acked-by: Kees Cook
    Cc: Al Viro
    Cc: Ben Woodard
    Cc: James Bottomley
    Cc: Jim Foraker
    Cc:
    Cc: Travis Gummels
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Patch series "exec: binfmt_misc: fix use-after-free, kill
    iname[BINPRM_BUF_SIZE]".

    It looks like this code was always wrong, then commit 948b701a607f
    ("binfmt_misc: add persistent opened binary handler for containers")
    added more problems.

    This patch (of 6):

    load_script() can simply use i_name instead, it points into bprm->buf[]
    and nobody can change this memory until we call prepare_binprm().

    The only complication is that we need to also change the signature of
    bprm_change_interp() but this change looks good too.

    While at it, do whitespace/style cleanups.

    NOTE: the real motivation for this change is that people want to
    increase BINPRM_BUF_SIZE, we need to change load_misc_binary() too but
    this looks more complicated because afaics it is very buggy.

    Link: http://lkml.kernel.org/r/20170918163446.GA26793@redhat.com
    Signed-off-by: Oleg Nesterov
    Acked-by: Kees Cook
    Cc: Travis Gummels
    Cc: Ben Woodard
    Cc: Jim Foraker
    Cc:
    Cc: Al Viro
    Cc: James Bottomley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • When reading the event from the uffd, we put it on a temporary
    fork_event list to detect if we can still access it after releasing and
    retaking the event_wqh.lock.

    If fork aborts and removes the event from the fork_event all is fine as
    long as we're still in the userfault read context and fork_event head is
    still alive.

    We've to put the event allocated in the fork kernel stack, back from
    fork_event list-head to the event_wqh head, before returning from
    userfaultfd_ctx_read, because the fork_event head lifetime is limited to
    the userfaultfd_ctx_read stack lifetime.

    Forgetting to move the event back to its event_wqh place then results in
    __remove_wait_queue(&ctx->event_wqh, &ewq->wq); in
    userfaultfd_event_wait_completion to remove it from a head that has been
    already freed from the reader stack.

    This could only happen if resolve_userfault_fork failed (for example if
    there are no file descriptors available to allocate the fork uffd). If
    it succeeded it was put back correctly.

    Furthermore, after find_userfault_evt receives a fork event, the forked
    userfault context in fork_nctx and uwq->msg.arg.reserved.reserved1 can
    be released by the fork thread as soon as the event_wqh.lock is
    released. Taking a reference on the fork_nctx before dropping the lock
    prevents an use after free in resolve_userfault_fork().

    If the fork side aborted and it already released everything, we still
    try to succeed resolve_userfault_fork(), if possible.

    Fixes: 893e26e61d04eac9 ("userfaultfd: non-cooperative: Add fork() event")
    Link: http://lkml.kernel.org/r/20170920180413.26713-1-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Reported-by: Mark Rutland
    Tested-by: Mark Rutland
    Cc: Pavel Emelyanov
    Cc: Mike Rapoport
    Cc: "Dr. David Alan Gilbert"
    Cc: Mike Kravetz
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

02 Oct, 2017

3 commits

  • previous commit 5d37ca14 "ceph: send LSSNAP request to auth mds
    of directory inode" is buggy. It makes __choose_mds() choose mds
    base on hash of '.snap' dentry.

    Signed-off-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Yan, Zheng
     
  • commit 3ae0bebc "ceph: queue cap snap only when snap realm's
    context changes" introduced a regression: we may not call
    queue_realm_cap_snaps() for newly created snap realm. This
    regression allows unflushed snapshot data to be overwritten.

    Link: http://tracker.ceph.com/issues/21483
    Signed-off-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Yan, Zheng
     
  • Pull scheduler fixes from Thomas Gleixner:
    "The scheduler pull request comes with the following updates:

    - Prevent a divide by zero issue by validating the input value of
    sysctl_sched_time_avg

    - Make task state printing consistent all over the place and have
    explicit state characters for IDLE and PARKED so they wont be
    displayed as 'D' state which confuses tools"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/sysctl: Check user input value of sysctl_sched_time_avg
    sched/debug: Add explicit TASK_PARKED printing
    sched/debug: Ignore TASK_IDLE for SysRq-W
    sched/debug: Add explicit TASK_IDLE printing
    sched/tracing: Use common task-state helpers
    sched/tracing: Fix trace_sched_switch task-state printing
    sched/debug: Remove unused variable
    sched/debug: Convert TASK_state to hex
    sched/debug: Implement consistent task-state printing

    Linus Torvalds
     

30 Sep, 2017

1 commit

  • Pull btrfs fixes from David Sterba:
    "We've collected a bunch of isolated fixes, for crashes, user-visible
    behaviour or missing bits from other subsystem cleanups from the past.

    The overall number is not small but I was not able to make it
    significantly smaller. Most of the patches are supposed to go to
    stable"

    * 'for-4.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
    btrfs: log csums for all modified extents
    Btrfs: fix unexpected result when dio reading corrupted blocks
    btrfs: Report error on removing qgroup if del_qgroup_item fails
    Btrfs: skip checksum when reading compressed data if some IO have failed
    Btrfs: fix kernel oops while reading compressed data
    Btrfs: use btrfs_op instead of bio_op in __btrfs_map_block
    Btrfs: do not backup tree roots when fsync
    btrfs: remove BTRFS_FS_QUOTA_DISABLING flag
    btrfs: propagate error to btrfs_cmp_data_prepare caller
    btrfs: prevent to set invalid default subvolid
    Btrfs: send: fix error number for unknown inode types
    btrfs: fix NULL pointer dereference from free_reloc_roots()
    btrfs: finish ordered extent cleaning if no progress is found
    btrfs: clear ordered flag on cleaning up ordered extents
    Btrfs: fix incorrect {node,sector}size endianness from BTRFS_IOC_FS_INFO
    Btrfs: do not reset bio->bi_ops while writing bio
    Btrfs: use the new helper wbc_to_write_flags

    Linus Torvalds
     

29 Sep, 2017

4 commits

  • Currently TASK_PARKED is masqueraded as TASK_INTERRUPTIBLE, give it
    its own print state because it will not in fact get woken by regular
    wakeups and is a long-term state.

    This requires moving TASK_PARKED into the TASK_REPORT mask, and since
    that latter needs to be a contiguous bitmask, we need to shuffle the
    bits around a bit.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Markus reported that kthreads that idle using TASK_IDLE instead of
    TASK_INTERRUPTIBLE are reported in as TASK_UNINTERRUPTIBLE and things
    like htop mark those red.

    This is undesirable, so add an explicit state for TASK_IDLE.

    Reported-by: Markus Trippelsdorf
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Currently get_task_state() and task_state_to_char() report different
    states, create a number of common helpers and unify the reported state
    space.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Pull xfs fixes from Darrick Wong:

    - fix various problems with the copy-on-write extent maps getting freed
    at the wrong time

    - fix printk format specifier problems

    - report zeroing operation outcomes instead of dropping them on the
    floor

    - fix some crashes when dio operations partially fail

    - fix a race condition between unwritten extent conversion & dio read

    - fix some incorrect tests in the inode log item processing

    - correct the delayed allocation space reservations on rmap filesystems

    - fix some problems checking for dax support

    * tag 'xfs-4.14-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
    xfs: revert "xfs: factor rmap btree size into the indlen calculations"
    xfs: Capture state of the right inode in xfs_iflush_done
    xfs: perag initialization should only touch m_ag_max_usable for AG 0
    xfs: update i_size after unwritten conversion in dio completion
    iomap_dio_rw: Allocate AIO completion queue before submitting dio
    xfs: validate bdev support for DAX inode flag
    xfs: remove redundant re-initialization of total_nr_pages
    xfs: Output warning message when discard option was enabled even though the device does not support discard
    xfs: report zeroed or not correctly in xfs_zero_range()
    xfs: kill meaningless variable 'zero'
    fs/xfs: Use %pS printk format for direct addresses
    xfs: evict CoW fork extents when performing finsert/fcollapse
    xfs: don't unconditionally clear the reflink flag on zero-block files

    Linus Torvalds
     

28 Sep, 2017

1 commit


27 Sep, 2017

8 commits

  • Eric has reported that since commit d2faa415166b "quota: Do not acquire
    dqio_sem for dquot overwrites in v2 format" test generic/232
    occasionally fails due to quota information being incorrect. Indeed that
    commit was too eager to remove dqio_sem completely from the path that
    just overwrites quota structure with updated information. Although that
    is innocent on its own, another process that inserts new quota structure
    to the same block can perform read-modify-write cycle of that block thus
    effectively discarding quota information update if they race in a wrong
    way.

    Fix the problem by acquiring dqio_sem for reading for overwrites of
    quota structure. Note that it *is* possible to completely avoid taking
    dqio_sem in the overwrite path however that will require modifying path
    inserting / deleting quota structures to avoid RMW cycles of the full
    block and for now it is not clear whether it is worth the hassle.

    Fixes: d2faa415166b2883428efa92f451774ef44373ac
    Reported-and-tested-by: Eric Whitney
    Signed-off-by: Jan Kara

    Jan Kara
     
  • In generic_file_llseek_size, return -ENXIO for negative offsets as well
    as offsets beyond EOF. This affects filesystems which don't implement
    SEEK_HOLE / SEEK_DATA internally, possibly because they don't support
    holes.

    Fixes xfstest generic/448.

    Signed-off-by: Andreas Gruenbacher
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Andreas Gruenbacher
     
  • In commit fd26a88093ba we added a worst case estimate for rmapbt blocks
    needed to satisfy the block mapping request. Since then, we added the
    ability to reserve enough space in each AG such that we should never run
    out of blocks to grow the rmapbt, which makes this calculation
    unnecessary. Revert the commit because it makes the extra delalloc
    indlen accounting unnecessary and incorrect.

    Reported-by: Eryu Guan
    Reviewed-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Darrick J. Wong

    Darrick J. Wong
     
  • My previous patch: d3a304b6292168b83b45d624784f973fdc1ca674 check for
    XFS_LI_FAILED flag xfs_iflush done, so the failed item can be properly
    resubmitted.

    In the loop scanning other inodes being completed, it should check the
    current item for the XFS_LI_FAILED, and not the initial one.

    The state of the initial inode is checked after the loop ends

    Kudos to Eric for catching this.

    Signed-off-by: Carlos Maiolino
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Carlos Maiolino
     
  • We call __xfs_ag_resv_init to make a per-AG reservation for each AG.
    This makes the reservation per-AG, not per-filesystem. Therefore, it
    is incorrect to adjust m_ag_max_usable for each AG. Adjust it only
    when we're reserving AG 0's blocks so that we only do it once per fs.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster

    Darrick J. Wong
     
  • Since commit d531d91d6990 ("xfs: always use unwritten extents for
    direct I/O writes"), we start allocating unwritten extents for all
    direct writes to allow appending aio in XFS.

    But for dio writes that could extend file size we update the in-core
    inode size first, then convert the unwritten extents to real
    allocations at dio completion time in xfs_dio_write_end_io(). Thus a
    racing direct read could see the new i_size and find the unwritten
    extents first and read zeros instead of actual data, if the direct
    writer also takes a shared iolock.

    Fix it by updating the in-core inode size after the unwritten extent
    conversion. To do this, introduce a new boolean argument to
    xfs_iomap_write_unwritten() to tell if we want to update in-core
    i_size or not.

    Suggested-by: Brian Foster
    Reviewed-by: Brian Foster
    Signed-off-by: Eryu Guan
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Eryu Guan
     
  • Executing xfs/104 test in a loop on Linux-v4.13 kernel on a ppc64
    machine can cause the following NULL pointer dereference,

    .queue_work_on+0x4c/0x80
    .iomap_dio_bio_end_io+0xbc/0x1f0
    .bio_endio+0x118/0x1f0
    .blk_update_request+0xd0/0x470
    .blk_mq_end_request+0x24/0xc0
    .lo_complete_rq+0x40/0xe0
    .__blk_mq_complete_request_remote+0x28/0x40
    .flush_smp_call_function_queue+0xc4/0x1e0
    .smp_ipi_demux_relaxed+0x8c/0x100
    .icp_hv_ipi_action+0x54/0xa0
    .__handle_irq_event_percpu+0x84/0x2c0
    .handle_irq_event_percpu+0x28/0x80
    .handle_percpu_irq+0x78/0xc0
    .generic_handle_irq+0x40/0x70
    .__do_irq+0x88/0x200
    .call_do_irq+0x14/0x24
    .do_IRQ+0x84/0x130

    This occurs due to the following sequence of events,

    1. Allocate dio for Direct I/O write.
    2. Invoke iomap_apply() until iov_iter_count() bytes have been submitted.
    - Assume that we have submitted atleast one bio. Hence iomap_dio->ref value
    will be >= 2.
    - If during the second iteration, iomap_apply() ends up returning -ENOSPC, we would
    break out of the loop and since the 'ret' value is a negative number we
    end up not allocating memory for super_block->s_dio_done_wq.
    3. Meanwhile, iomap_dio_bio_end_io() is invoked for bios that have been
    submitted and here the code ends up dereferencing the NULL pointer stored
    at super_block->s_dio_done_wq.

    This commit fixes the bug by allocating memory for
    super_block->s_dio_done_wq before iomap_apply() is invoked.

    Reported-by: Eryu Guan
    Reviewed-by: Christoph Hellwig
    Tested-by: Eryu Guan
    Signed-off-by: Chandan Rajendra
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Chandan Rajendra
     
  • Currently only the blocksize is checked, but we should really be calling
    bdev_dax_supported() which also tests to make sure we can get a
    struct dax_device and that the dax_direct_access() path is working.

    This is the same check that we do for the "-o dax" mount option in
    xfs_fs_fill_super().

    This does not fix the race issues that caused the XFS DAX inode option to
    be disabled, so that option will still be disabled. If/when we re-enable
    it, though, I think we will want this issue to have been fixed. I also do
    think that we want to fix this in stable kernels.

    Signed-off-by: Ross Zwisler
    CC: stable@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Ross Zwisler