Eric Lee / smarc-fsl-linux-kernel

05 Mar, 2020

1 commit

0d70361de xfs: clear kernel only flags in XFS_IOC_ATTRMULTI_BY_HANDLE ... Browse Code »

commit 953aa9d136f53e226448dbd801a905c28f8071bf upstream.

Don't allow passing arbitrary flags as they change behavior including
memory allocation that the call stack is not prepared for.

Fixes: ddbca70cc45c ("xfs: allocate xattr buffer on demand")
Signed-off-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong
Signed-off-by: Greg Kroah-Hartman

Christoph Hellwig
2020-03-05 23:43:51 +0800

26 Jan, 2020

1 commit

dc08e4455 xfs: Sanity check flags of Q_XQUOTARM call ... Browse Code »

commit 3dd4d40b420846dd35869ccc8f8627feef2cff32 upstream.

Flags passed to Q_XQUOTARM were not sanity checked for invalid values.
Fix that.

Fixes: 9da93f9b7cdf ("xfs: fix Q_XQUOTARM ioctl")
Reported-by: Yang Xu
Signed-off-by: Jan Kara
Reviewed-by: Eric Sandeen
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong
Signed-off-by: Greg Kroah-Hartman

Jan Kara
2020-01-26 17:00:59 +0800

09 Jan, 2020

2 commits

b8233f7b2 xfs: periodically yield scrub threads to the scheduler ... Browse Code »

[ Upstream commit 5d1116d4c6af3e580f1ed0382ca5a94bd65a34cf ]

Christoph Hellwig complained about the following soft lockup warning
when running scrub after generic/175 when preemption is disabled and
slub debugging is enabled:

watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [xfs_scrub:161]
Modules linked in:
irq event stamp: 41692326
hardirqs last enabled at (41692325): [] _raw_0
hardirqs last disabled at (41692326): [] trace0
softirqs last enabled at (41684994): [] __do_e
softirqs last disabled at (41684987): [] irq_e0
CPU: 3 PID: 16189 Comm: xfs_scrub Not tainted 5.4.0-rc3+ #30
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.124
RIP: 0010:_raw_spin_unlock_irqrestore+0x39/0x40
Code: 89 f3 be 01 00 00 00 e8 d5 3a e5 fe 48 89 ef e8 ed 87 e5 f2
RSP: 0018:ffffc9000233f970 EFLAGS: 00000286 ORIG_RAX: ffffffffff3
RAX: ffff88813b398040 RBX: 0000000000000286 RCX: 0000000000000006
RDX: 0000000000000006 RSI: ffff88813b3988c0 RDI: ffff88813b398040
RBP: ffff888137958640 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffea00042b0c00
R13: 0000000000000001 R14: ffff88810ac32308 R15: ffff8881376fc040
FS: 00007f6113dea700(0000) GS:ffff88813bb80000(0000) knlGS:00000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f6113de8ff8 CR3: 000000012f290000 CR4: 00000000000006e0
Call Trace:
free_debug_processing+0x1dd/0x240
__slab_free+0x231/0x410
kmem_cache_free+0x30e/0x360
xchk_ag_btcur_free+0x76/0xb0
xchk_ag_free+0x10/0x80
xchk_bmap_iextent_xref.isra.14+0xd9/0x120
xchk_bmap_iextent+0x187/0x210
xchk_bmap+0x2e0/0x3b0
xfs_scrub_metadata+0x2e7/0x500
xfs_ioc_scrub_metadata+0x4a/0xa0
xfs_file_ioctl+0x58a/0xcd0
do_vfs_ioctl+0xa0/0x6f0
ksys_ioctl+0x5b/0x90
__x64_sys_ioctl+0x11/0x20
do_syscall_64+0x4b/0x1a0
entry_SYSCALL_64_after_hwframe+0x49/0xbe

If preemption is disabled, all metadata buffers needed to perform the
scrub are already in memory, and there are a lot of records to check,
it's possible that the scrub thread will run for an extended period of
time without sleeping for IO or any other reason. Then the watchdog
timer or the RCU stall timeout can trigger, producing the backtrace
above.

To fix this problem, call cond_resched() from the scrub thread so that
we back out to the scheduler whenever necessary.

Reported-by: Christoph Hellwig
Signed-off-by: Darrick J. Wong
Reviewed-by: Christoph Hellwig
Signed-off-by: Sasha Levin

Darrick J. Wong
2020-01-09 17:20:06 +0800
5c440a4d1 xfs: don't check for AG deadlock for realtime files in bunmapi ... Browse Code »

commit 69ffe5960df16938bccfe1b65382af0b3de51265 upstream.

Commit 5b094d6dac04 ("xfs: fix multi-AG deadlock in xfs_bunmapi") added
a check in __xfs_bunmapi() to stop early if we would touch multiple AGs
in the wrong order. However, this check isn't applicable for realtime
files. In most cases, it just makes us do unnecessary commits. However,
without the fix from the previous commit ("xfs: fix realtime file data
space leak"), if the last and second-to-last extents also happen to have
different "AG numbers", then the break actually causes __xfs_bunmapi()
to return without making any progress, which sends
xfs_itruncate_extents_flags() into an infinite loop.

Fixes: 5b094d6dac04 ("xfs: fix multi-AG deadlock in xfs_bunmapi")
Signed-off-by: Omar Sandoval
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong
Signed-off-by: Greg Kroah-Hartman

Omar Sandoval
2020-01-09 17:20:03 +0800

05 Jan, 2020

1 commit

03c074b26 xfs: fix mount failure crash on invalid iclog memory access ... Browse Code »

commit 798a9cada4694ca8d970259f216cec47e675bfd5 upstream.

syzbot (via KASAN) reports a use-after-free in the error path of
xlog_alloc_log(). Specifically, the iclog freeing loop doesn't
handle the case of a fully initialized ->l_iclog linked list.
Instead, it assumes that the list is partially constructed and NULL
terminated.

This bug manifested because there was no possible error scenario
after iclog list setup when the original code was added. Subsequent
code and associated error conditions were added some time later,
while the original error handling code was never updated. Fix up the
error loop to terminate either on a NULL iclog or reaching the end
of the list.

Reported-by: syzbot+c732f8644185de340492@syzkaller.appspotmail.com
Signed-off-by: Brian Foster
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong
Signed-off-by: Greg Kroah-Hartman

Brian Foster
2020-01-05 02:18:43 +0800

15 Oct, 2019

1 commit

5e0cd1ef6 xfs: change the seconds fields in xfs_bulkstat to signed ... Browse Code »

64-bit time is a signed quantity in the kernel, so the bulkstat
structure should reflect that. Note that the structure size stays
the same and that we have not yet published userspace headers for this
new ioctl so there are no users to break.

Fixes: 7035f9724f84 ("xfs: introduce new v5 bulkstat structure")
Signed-off-by: Darrick J. Wong
Reviewed-by: Carlos Maiolino
Reviewed-by: Christoph Hellwig

Darrick J. Wong
2019-10-15 23:46:07 +0800

09 Oct, 2019

3 commits

aeea4b75f xfs: move local to extent inode logging into bmap helper ... Browse Code »

The callers of xfs_bmap_local_to_extents_empty() log the inode
external to the function, yet this function is where the on-disk
format value is updated. Push the inode logging down into the
function itself to help prevent future mistakes.

Note that internal bmap callers track the inode logging flags
independently and thus may log the inode core twice due to this
change. This is harmless, so leave this code around for consistency
with the other attr fork conversion functions.

Signed-off-by: Brian Foster
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Brian Foster
2019-10-09 23:54:30 +0800
603efebd6 xfs: remove broken error handling on failed attr sf to leaf change ... Browse Code »

xfs_attr_shortform_to_leaf() attempts to put the shortform fork back
together after a failed attempt to convert from shortform to leaf
format. While this code reallocates and copies back the shortform
attr fork data, it never resets the inode format field back to local
format. Further, now that the inode is properly logged after the
initial switch from local format, any error that triggers the
recovery code will eventually abort the transaction and shutdown the
fs. Therefore, remove the broken and unnecessary error handling
code.

Signed-off-by: Brian Foster
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Brian Foster
2019-10-09 23:54:30 +0800
0b10d8a89 xfs: log the inode on directory sf to block format change ... Browse Code »

When a directory changes from shortform (sf) to block format, the sf
format is copied to a temporary buffer, the inode format is modified
and the updated format filled with the dentries from the temporary
buffer. If the inode format is modified and attempt to grow the
inode fails (due to I/O error, for example), it is possible to
return an error while leaving the directory in an inconsistent state
and with an otherwise clean transaction. This results in corruption
of the associated directory and leads to xfs_dabuf_map() errors as
subsequent lookups cannot accurately determine the format of the
directory. This problem is reproduced occasionally by generic/475.

The fundamental problem is that xfs_dir2_sf_to_block() changes the
on-disk inode format without logging the inode. The inode is
eventually logged by the bmapi layer in the common case, but error
checking introduces the possibility of failing the high level
request before this happens.

Update both of the dir2 and attr callers of
xfs_bmap_local_to_extents_empty() to log the inode core as
consistent with the bmap local to extent format change codepath.
This ensures that any subsequent errors after the format has changed
cause the transaction to abort.

Signed-off-by: Brian Foster
Reviewed-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Brian Foster
2019-10-09 23:54:30 +0800

07 Oct, 2019

4 commits

3219e8cf0 xfs: assure zeroed memory buffers for certain kmem allocations ... Browse Code »

Guarantee zeroed memory buffers for cases where potential memory
leak to disk can occur. In these cases, kmem_alloc is used and
doesn't zero the buffer, opening the possibility of information
leakage to disk.

Use existing infrastucture (xfs_buf_allocate_memory) to obtain
the already zeroed buffer from kernel memory.

This solution avoids the performance issue that would occur if a
wholesale change to replace kmem_alloc with kmem_zalloc was done.

Signed-off-by: Bill O'Donnell
[darrick: fix bitwise complaint about kmflag_mask]
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Bill O'Donnell
2019-10-07 06:39:06 +0800
d5cc14d9f xfs: removed unused error variable from xchk_refcountbt_rec ... Browse Code »

Removed unused error variable. Instead of using error variable,
returned the value directly as it wasn't updated.

Signed-off-by: Aliasgar Surti
Reviewed-by: Brian Foster
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Aliasgar Surti
2019-10-07 06:39:05 +0800
6374ca039 xfs: remove unused flags arg from xfs_get_aghdr_buf() ... Browse Code »

The flags arg is always passed as zero, so remove it.

(xfs_buf_get_uncached takes flags to support XBF_NO_IOACCT for
the sb, but that should never be relevant for xfs_get_aghdr_buf)

Signed-off-by: Eric Sandeen
Reviewed-by: Carlos Maiolino
Reviewed-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Eric Sandeen
2019-10-07 06:39:05 +0800
e093c4be7 xfs: Fix tail rounding in xfs_alloc_file_space() ... Browse Code »

To ensure that all blocks touched by the range [offset, offset + count)
are allocated, we need to calculate the block count from the difference
of the range end (rounded up) and the range start (rounded down).

Before this patch, we just round up the byte count, which may lead to
unaligned ranges not being fully allocated:

$ touch test_file
$ block_size=$(stat -fc '%S' test_file)
$ fallocate -o $((block_size / 2)) -l $block_size test_file
$ xfs_bmap test_file
test_file:
0: [0..7]: 1396264..1396271
1: [8..15]: hole

There should not be a hole there. Instead, the first two blocks should
be fully allocated.

With this patch applied, the result is something like this:

$ touch test_file
$ block_size=$(stat -fc '%S' test_file)
$ fallocate -o $((block_size / 2)) -l $block_size test_file
$ xfs_bmap test_file
test_file:
0: [0..15]: 11024..11039

Signed-off-by: Max Reitz
Reviewed-by: Carlos Maiolino
Reviewed-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Max Reitz
2019-10-07 06:39:05 +0800

27 Sep, 2019

3 commits

2268419e4 Merge tag 'xfs-5.4-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux ... Browse Code »

Pull xfs fixes from Darrick Wong:
"There are a couple of bug fixes and some small code cleanups that came
in recently:

- Minor code cleanups

- Fix a superblock logging error

- Ensure that collapse range converts the data fork to extents format
when necessary

- Revert the ALLOC_USERDATA cleanup because it caused subtle behavior
regressions"

* tag 'xfs-5.4-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: avoid unused to_mp() function warning
xfs: log proper length of superblock
xfs: revert 1baa2800e62d ("xfs: remove the unused XFS_ALLOC_USERDATA flag")
xfs: removed unneeded variable
xfs: convert inode to extent format after extent merge due to shift

Linus Torvalds
2019-09-27 02:36:20 +0800
cbafe18c7 Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge more updates from Andrew Morton:

- almost all of the rest of -mm

- various other subsystems

Subsystems affected by this patch series:
memcg, misc, core-kernel, lib, checkpatch, reiserfs, fat, fork,
cpumask, kexec, uaccess, kconfig, kgdb, bug, ipc, lzo, kasan, madvise,
cleanups, pagemap

* emailed patches from Andrew Morton : (77 commits)
arch/sparc/include/asm/pgtable_64.h: fix build
mm: treewide: clarify pgtable_page_{ctor,dtor}() naming
ntfs: remove (un)?likely() from IS_ERR() conditions
IB/hfi1: remove unlikely() from IS_ERR*() condition
xfs: remove unlikely() from WARN_ON() condition
wimax/i2400m: remove unlikely() from WARN*() condition
fs: remove unlikely() from WARN_ON() condition
xen/events: remove unlikely() from WARN() condition
checkpatch: check for nested (un)?likely() calls
hexagon: drop empty and unused free_initrd_mem
mm: factor out common parts between MADV_COLD and MADV_PAGEOUT
mm: introduce MADV_PAGEOUT
mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM
mm: introduce MADV_COLD
mm: untag user pointers in mmap/munmap/mremap/brk
vfio/type1: untag user pointers in vaddr_get_pfn
tee/shm: untag user pointers in tee_shm_register
media/v4l2-core: untag user pointers in videobuf_dma_contig_user_get
drm/radeon: untag user pointers in radeon_gem_userptr_ioctl
drm/amdgpu: untag user pointers
...

Linus Torvalds
2019-09-27 01:29:42 +0800
14ed86880 xfs: remove unlikely() from WARN_ON() condition ... Browse Code »

"unlikely(WARN_ON(x))" is excessive. WARN_ON() already uses unlikely()
internally.

Link: http://lkml.kernel.org/r/20190829165025.15750-7-efremov@linux.com
Signed-off-by: Denis Efremov
Reviewed-by: Darrick J. Wong
Cc: Joe Perches
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Denis Efremov
2019-09-27 01:10:30 +0800

26 Sep, 2019

1 commit

4ef5b13a2 Merge tag 'iomap-5.4-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux ... Browse Code »

Pull iomap updates from Darrick Wong:
"After last week's failed pull request attempt, I scuttled everything
in the branch except for the directio endio api changes, which were
trivial. Everything else will simply have to wait for the next cycle.

Summary:

- Report both io errors and short io results to the directio endio
handler.

- Allow directio callers to pass an ops structure to iomap_dio_rw"

* tag 'iomap-5.4-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
iomap: move the iomap_dio_rw ->end_io callback into a structure
iomap: split size and error for iomap_dio_rw ->end_io

Linus Torvalds
2019-09-26 00:01:43 +0800

25 Sep, 2019

1 commit

88d32d398 xfs: avoid unused to_mp() function warning ... Browse Code »

to_mp() was first introduced with the following commit:
'commit 801cc4e17a34c ("xfs: debug mode forced buffered write failure")'

But the user of to_mp() was removed by below commit:
'commit f8c47250ba46e ("xfs: convert drop_writes to use the errortag
mechanism")'

So kernel build with clang throws below warning message:

fs/xfs/xfs_sysfs.c:72:1: warning: unused function 'to_mp' [-Wunused-function]
to_mp(struct kobject *kobject)

Hence to_mp() might be removed safely to get rid of warning message.

Signed-off-by: Austin Kim
Reviewed-by: Brian Foster
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Austin Kim
2019-09-25 00:40:19 +0800

24 Sep, 2019

4 commits

6f4ff81a4 xfs: log proper length of superblock ... Browse Code »

xfs_trans_log_buf takes first byte, last byte as args. In this
case, it should be from 0 to sizeof() - 1.

Signed-off-by: Eric Sandeen
Reviewed-by: Dave Chinner
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Eric Sandeen
2019-09-24 23:00:36 +0800
ce8404292 xfs: revert 1baa2800e62d ("xfs: remove the unused XFS_ALLOC_USERDATA flag") ... Browse Code »

Revert this commit, as it caused periodic regressions in xfs/173 w/
1k blocks.

[1] https://lore.kernel.org/lkml/20190919014602.GN15734@shao2-debian/

Signed-off-by: Darrick J. Wong
Reviewed-by: Brian Foster

Darrick J. Wong
2019-09-24 04:05:00 +0800
583e4eff9 xfs: removed unneeded variable ... Browse Code »

Returned value directly instead of using variable as it wasn't updated.

Signed-off-by: Aliasgar Surti
Reviewed-by: Brian Foster
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Aliasgar Surti
2019-09-24 04:00:56 +0800
e20e174ca xfs: convert inode to extent format after extent merge due to shift ... Browse Code »

The collapse range operation can merge extents if two newly adjacent
extents are physically contiguous. If the extent count is reduced on
a btree format inode, a change to extent format might be necessary.
This format change currently occurs as a side effect of the file
size update after extents have been shifted for the collapse. This
codepath ultimately calls xfs_bunmapi(), which happens to check for
and execute the format conversion even if there were no blocks
removed from the mapping.

While this ultimately puts the inode into the correct state, the
fact the format conversion occurs in a separate transaction from the
change that called for it is a problem. If an extent shift
transaction commits and the filesystem happens to crash before the
format conversion, the inode fork is left in a corrupted state after
log recovery. The inode fork verifier fails and xfs_repair
ultimately nukes the inode. This problem was originally reproduced
by generic/388.

Similar to how the insert range extent split code handles extent to
btree conversion, update the collapse range extent merge code to
handle btree to extent format conversion in the same transaction
that merges the extents. This ensures that the inode fork format
remains consistent if the filesystem happens to crash in the middle
of a collapse range operation that changes the inode fork format.

Signed-off-by: Brian Foster
Reviewed-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Brian Foster
2019-09-24 04:00:14 +0800

20 Sep, 2019

3 commits

838c4f3d7 iomap: move the iomap_dio_rw ->end_io callback into a structure ... Browse Code »

Add a new iomap_dio_ops structure that for now just contains the end_io
handler. This avoid storing the function pointer in a mutable structure,
which is a possible exploit vector for kernel code execution, and prepares
for adding a submit_io handler that btrfs needs.

Signed-off-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Christoph Hellwig
2019-09-20 06:32:45 +0800
6fe7b9901 iomap: split size and error for iomap_dio_rw ->end_io ... Browse Code »

Modify the calling convention for the iomap_dio_rw ->end_io() callback.
Rather than passing either dio->error or dio->size as the 'size' argument,
instead pass both the dio->error and the dio->size value separately.

In the instance that an error occurred during a write, we currently cannot
determine whether any blocks have been allocated beyond the current EOF and
data has subsequently been written to these blocks within the ->end_io()
callback. As a result, we cannot judge whether we should take the truncate
failed write path. Having both dio->error and dio->size will allow us to
perform such checks within this callback.

Signed-off-by: Matthew Bobrowski
[hch: minor cleanups]
Signed-off-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong
Reviewed-by: Matthew Wilcox (Oracle)

Matthew Bobrowski
2019-09-20 06:32:44 +0800
cfb82e1df Merge tag 'y2038-vfs' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground ... Browse Code »

Pull y2038 vfs updates from Arnd Bergmann:
"Add inode timestamp clamping.

This series from Deepa Dinamani adds a per-superblock minimum/maximum
timestamp limit for a file system, and clamps timestamps as they are
written, to avoid random behavior from integer overflow as well as
having different time stamps on disk vs in memory.

At mount time, a warning is now printed for any file system that can
represent current timestamps but not future timestamps more than 30
years into the future, similar to the arbitrary 30 year limit that was
added to settimeofday().

This was picked as a compromise to warn users to migrate to other file
systems (e.g. ext4 instead of ext3) when they need the file system to
survive beyond 2038 (or similar limits in other file systems), but not
get in the way of normal usage"

* tag 'y2038-vfs' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground:
ext4: Reduce ext4 timestamp warnings
isofs: Initialize filesystem timestamp ranges
pstore: fs superblock limits
fs: omfs: Initialize filesystem timestamp ranges
fs: hpfs: Initialize filesystem timestamp ranges
fs: ceph: Initialize filesystem timestamp ranges
fs: sysv: Initialize filesystem timestamp ranges
fs: affs: Initialize filesystem timestamp ranges
fs: fat: Initialize filesystem timestamp ranges
fs: cifs: Initialize filesystem timestamp ranges
fs: nfs: Initialize filesystem timestamp ranges
ext4: Initialize timestamps limits
9p: Fill min and max timestamps in sb
fs: Fill in max and min timestamps in superblock
utimes: Clamp the timestamps before update
mount: Add mount warning for impending timestamp expiry
timestamp_truncate: Replace users of timespec64_trunc
vfs: Add timestamp_truncate() api
vfs: Add file timestamp range support

Linus Torvalds
2019-09-20 00:42:37 +0800

19 Sep, 2019

2 commits

b41dae061 Merge tag 'xfs-5.4-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux ... Browse Code »

Pull xfs updates from Darrick Wong:
"For this cycle we have the usual pile of cleanups and bug fixes, some
performance improvements for online metadata scrubbing, massive
speedups in the directory entry creation code, some performance
improvement in the file ACL lookup code, a fix for a logging stall
during mount, and fixes for concurrency problems.

It has survived a couple of weeks of xfstests runs and merges cleanly.

Summary:

- Remove KM_SLEEP/KM_NOSLEEP.

- Ensure that memory buffers for IO are properly sector-aligned to
avoid problems that the block layer doesn't check.

- Make the bmap scrubber more efficient in its record checking.

- Don't crash xfs_db when superblock inode geometry is corrupt.

- Fix btree key helper functions.

- Remove unneeded error returns for things that can't fail.

- Fix buffer logging bugs in repair.

- Clean up iterator return values.

- Speed up directory entry creation.

- Enable allocation of xattr value memory buffer during lookup.

- Fix readahead racing with truncate/punch hole.

- Other minor cleanups.

- Fix one AGI/AGF deadlock with RENAME_WHITEOUT.

- More BUG -> WARN whackamole.

- Fix various problems with the log failing to advance under certain
circumstances, which results in stalls during mount"

* tag 'xfs-5.4-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (45 commits)
xfs: push the grant head when the log head moves forward
xfs: push iclog state cleaning into xlog_state_clean_log
xfs: factor iclog state processing out of xlog_state_do_callback()
xfs: factor callbacks out of xlog_state_do_callback()
xfs: factor debug code out of xlog_state_do_callback()
xfs: prevent CIL push holdoff in log recovery
xfs: fix missed wakeup on l_flush_wait
xfs: push the AIL in xlog_grant_head_wake
xfs: Use WARN_ON_ONCE for bailout mount-operation
xfs: Fix deadlock between AGI and AGF with RENAME_WHITEOUT
xfs: define a flags field for the AG geometry ioctl structure
xfs: add a xfs_valid_startblock helper
xfs: remove the unused XFS_ALLOC_USERDATA flag
xfs: cleanup xfs_fsb_to_db
xfs: fix the dax supported check in xfs_ioctl_setattr_dax_invalidate
xfs: Fix stale data exposure when readahead races with hole punch
fs: Export generic_fadvise()
mm: Handle MADV_WILLNEED through vfs_fadvise()
xfs: allocate xattr buffer on demand
xfs: consolidate attribute value copying
...

Linus Torvalds
2019-09-19 09:32:43 +0800
53e5e7a7a Merge branch 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs namei updates from Al Viro:
"Pathwalk-related stuff"

[ Audit-related cleanups, misc simplifications, and easier to follow
nd->root refcounts - Linus ]

* 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
devpts_pty_kill(): don't bother with d_delete()
infiniband: don't bother with d_delete()
hypfs: don't bother with d_delete()
fs/namei.c: keep track of nd->root refcount status
fs/namei.c: new helper - legitimize_root()
kill the last users of user_{path,lpath,path_dir}()
namei.h: get the comments on LOOKUP_... in sync with reality
kill LOOKUP_NO_EVAL, don't bother including namei.h from audit.h
audit_inode(): switch to passing AUDIT_INODE_...
filename_mountpoint(): make LOOKUP_NO_EVAL unconditional there
filename_lookup(): audit_inode() argument is always 0

Linus Torvalds
2019-09-19 04:03:01 +0800

06 Sep, 2019

9 commits

14e15f1bc xfs: push the grant head when the log head moves forward ... Browse Code »

When the log fills up, we can get into the state where the
outstanding items in the CIL being committed and aggregated are
larger than the range that the reservation grant head tail pushing
will attempt to clean. This can result in the tail pushing range
being trimmed back to the the log head (l_last_sync_lsn) and so
may not actually move the push target at all.

When the iclogs associated with the CIL commit finally land, the
log head moves forward, and this removes the restriction on the AIL
push target. However, if we already have transactions sleeping on
the grant head, and there's nothing in the AIL still to flush from
the current push target, then nothing will move the tail of the log
and trigger a log reservation wakeup.

Hence the there is nothing that will trigger xlog_grant_push_ail()
to recalculate the AIL push target and start pushing on the AIL
again to write back the metadata objects that pin the tail of the
log and hence free up space and allow the transaction reservations
to be woken and make progress.

Hence we need to push on the grant head when we move the log head
forward, as this may be the only trigger we have that can move the
AIL push target forwards in this situation.

Signed-off-by: Dave Chinner
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Dave Chinner
2019-09-06 12:36:13 +0800
0383f543d xfs: push iclog state cleaning into xlog_state_clean_log ... Browse Code »

xlog_state_clean_log() is only called from one place, and it occurs
when an iclog is transitioning back to ACTIVE. Prior to calling
xlog_state_clean_log, the iclog we are processing has a hard coded
state check to DIRTY so that xlog_state_clean_log() processes it
correctly. We also have a hard coded wakeup after
xlog_state_clean_log() to enfore log force waiters on that iclog
are woken correctly.

Both of these things are operations required to finish processing an
iclog and return it to the ACTIVE state again, so they make little
sense to be separated from the rest of the clean state transition
code.

Hence push these things inside xlog_state_clean_log(), document the
behaviour and rename it xlog_state_clean_iclog() to indicate that
it's being driven by an iclog state change and does the iclog state
change work itself.

Signed-off-by: Dave Chinner
Reviewed-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Dave Chinner
2019-09-06 12:36:12 +0800
5e96fa8d2 xfs: factor iclog state processing out of xlog_state_do_callback() ... Browse Code »

The iclog IO completion state processing is somewhat complex, and
because it's inside two nested loops it is highly indented and very
hard to read. Factor it out, flatten the logic flow and clean up the
comments so that it much easier to see what the code is doing both
in processing the individual iclogs and in the over
xlog_state_do_callback() operation.

Signed-off-by: Dave Chinner
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Dave Chinner
2019-09-06 12:36:12 +0800
6546818c8 xfs: factor callbacks out of xlog_state_do_callback() ... Browse Code »

Simplify the code flow by lifting the iclog callback work out of
the main iclog iteration loop. This isolates the log juggling and
callbacks from the iclog state change logic in the loop.

Note that the loopdidcallbacks variable is not actually tracking
whether callbacks are actually run - it is tracking whether the
icloglock was dropped during the loop and so determines if we
completed the entire iclog scan loop atomically. Hence we know for
certain there are either no more ordered completions to run or
that the next completion will run the remaining ordered iclog
completions. Hence rename that variable appropriately for it's
function.

Signed-off-by: Dave Chinner
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Dave Chinner
2019-09-06 12:36:12 +0800
6769aa2a4 xfs: factor debug code out of xlog_state_do_callback() ... Browse Code »

Start making this function readable by lifting the debug code into
a conditional function.

Signed-off-by: Dave Chinner
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Dave Chinner
2019-09-06 12:36:12 +0800
8ab39f11d xfs: prevent CIL push holdoff in log recovery ... Browse Code »

generic/530 on a machine with enough ram and a non-preemptible
kernel can run the AGI processing phase of log recovery enitrely out
of cache. This means it never blocks on locks, never waits for IO
and runs entirely through the unlinked lists until it either
completes or blocks and hangs because it has run out of log space.

It runs out of log space because the background CIL push is
scheduled but never runs. queue_work() queues the CIL work on the
current CPU that is busy, and the workqueue code will not run it on
any other CPU. Hence if the unlinked list processing never yields
the CPU voluntarily, the push work is delayed indefinitely. This
results in the CIL aggregating changes until all the log space is
consumed.

When the log recoveyr processing evenutally blocks, the CIL flushes
but because the last iclog isn't submitted for IO because it isn't
full, the CIL flush never completes and nothing ever moves the log
head forwards, or indeed inserts anything into the tail of the log,
and hence nothing is able to get the log moving again and recovery
hangs.

There are several problems here, but the two obvious ones from
the trace are that:
a) log recovery does not yield the CPU for over 4 seconds,
b) binding CIL pushes to a single CPU is a really bad idea.

This patch addresses just these two aspects of the problem, and are
suitable for backporting to work around any issues in older kernels.
The more fundamental problem of preventing the CIL from consuming
more than 50% of the log without committing will take more invasive
and complex work, so will be done as followup work.

Signed-off-by: Dave Chinner
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Dave Chinner
2019-09-06 12:36:12 +0800
cdea5459c xfs: fix missed wakeup on l_flush_wait ... Browse Code »

The code in xlog_wait uses the spinlock to make adding the task to
the wait queue, and setting the task state to UNINTERRUPTIBLE atomic
with respect to the waker.

Doing the wakeup after releasing the spinlock opens up the following
race condition:

Task 1 task 2
add task to wait queue
wake up task
set task state to UNINTERRUPTIBLE

This issue was found through code inspection as a result of kworkers
being observed stuck in UNINTERRUPTIBLE state with an empty
wait queue. It is rare and largely unreproducable.

Simply moving the spin_unlock to after the wake_up_all results
in the waker not being able to see a task on the waitqueue before
it has set its state to UNINTERRUPTIBLE.

This bug dates back to the conversion of this code to generic
waitqueue infrastructure from a counting semaphore back in 2008
which didn't place the wakeups consistently w.r.t. to the relevant
spin locks.

[dchinner: Also fix a similar issue in the shutdown path on
xc_commit_wait. Update commit log with more details of the issue.]

Fixes: d748c62367eb ("[XFS] Convert l_flushsema to a sv_t")
Reported-by: Chris Mason
Signed-off-by: Rik van Riel
Signed-off-by: Dave Chinner
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Rik van Riel
2019-09-06 12:36:12 +0800
7c107afb8 xfs: push the AIL in xlog_grant_head_wake ... Browse Code »

In the situation where the log is full and the CIL has not recently
flushed, the AIL push threshold is throttled back to the where the
last write of the head of the log was completed. This is stored in
log->l_last_sync_lsn. Hence if the CIL holds > 25% of the log space
pinned by flushes and/or aggregation in progress, we can get the
situation where the head of the log lags a long way behind the
reservation grant head.

When this happens, the AIL push target is trimmed back from where
the reservation grant head wants to push the log tail to, back to
where the head of the log currently is. This means the push target
doesn't reach far enough into the log to actually move the tail
before the transaction reservation goes to sleep.

When the CIL push completes, it moves the log head forward such that
the AIL push target can now be moved, but that has no mechanism for
puhsing the log tail. Further, if the next tail movement of the log
is not large enough wake the waiter (i.e. still not enough space for
it to have a reservation granted), we don't wake anything up, and
hence we do not update the AIL push target to take into account the
head of the log moving and allowing the push target to be moved
forwards.

To avoid this particular condition, if we fail to wake the first
waiter on the grant head because we don't have enough space,
push on the AIL again. This will pick up any movement of the log
head and allow the push target to move forward due to completion of
CIL pushing.

Signed-off-by: Dave Chinner
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Dave Chinner
2019-09-06 12:36:12 +0800
eb2e99943 xfs: Use WARN_ON_ONCE for bailout mount-operation ... Browse Code »

If the CONFIG_BUG is enabled, BUG is executed and then system is crashed.
However, the bailout for mount is no longer proceeding.

Using WARN_ON_ONCE rather than BUG can prevent this situation.

Signed-off-by: Austin Kim
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Austin Kim
2019-09-06 12:36:12 +0800

04 Sep, 2019

2 commits

bc56ad8c7 xfs: Fix deadlock between AGI and AGF with RENAME_WHITEOUT ... Browse Code »

When performing rename operation with RENAME_WHITEOUT flag, we will
hold AGF lock to allocate or free extents in manipulating the dirents
firstly, and then doing the xfs_iunlink_remove() call last to hold
AGI lock to modify the tmpfile info, so we the lock order AGI->AGF.

The big problem here is that we have an ordering constraint on AGF
and AGI locking - inode allocation locks the AGI, then can allocate
a new extent for new inodes, locking the AGF after the AGI. Hence
the ordering that is imposed by other parts of the code is AGI before
AGF. So we get an ABBA deadlock between the AGI and AGF here.

Process A:
Call trace:
? __schedule+0x2bd/0x620
schedule+0x33/0x90
schedule_timeout+0x17d/0x290
__down_common+0xef/0x125
? xfs_buf_find+0x215/0x6c0 [xfs]
down+0x3b/0x50
xfs_buf_lock+0x34/0xf0 [xfs]
xfs_buf_find+0x215/0x6c0 [xfs]
xfs_buf_get_map+0x37/0x230 [xfs]
xfs_buf_read_map+0x29/0x190 [xfs]
xfs_trans_read_buf_map+0x13d/0x520 [xfs]
xfs_read_agf+0xa6/0x180 [xfs]
? schedule_timeout+0x17d/0x290
xfs_alloc_read_agf+0x52/0x1f0 [xfs]
xfs_alloc_fix_freelist+0x432/0x590 [xfs]
? down+0x3b/0x50
? xfs_buf_lock+0x34/0xf0 [xfs]
? xfs_buf_find+0x215/0x6c0 [xfs]
xfs_alloc_vextent+0x301/0x6c0 [xfs]
xfs_ialloc_ag_alloc+0x182/0x700 [xfs]
? _xfs_trans_bjoin+0x72/0xf0 [xfs]
xfs_dialloc+0x116/0x290 [xfs]
xfs_ialloc+0x6d/0x5e0 [xfs]
? xfs_log_reserve+0x165/0x280 [xfs]
xfs_dir_ialloc+0x8c/0x240 [xfs]
xfs_create+0x35a/0x610 [xfs]
xfs_generic_create+0x1f1/0x2f0 [xfs]
...

Process B:
Call trace:
? __schedule+0x2bd/0x620
? xfs_bmapi_allocate+0x245/0x380 [xfs]
schedule+0x33/0x90
schedule_timeout+0x17d/0x290
? xfs_buf_find+0x1fd/0x6c0 [xfs]
__down_common+0xef/0x125
? xfs_buf_get_map+0x37/0x230 [xfs]
? xfs_buf_find+0x215/0x6c0 [xfs]
down+0x3b/0x50
xfs_buf_lock+0x34/0xf0 [xfs]
xfs_buf_find+0x215/0x6c0 [xfs]
xfs_buf_get_map+0x37/0x230 [xfs]
xfs_buf_read_map+0x29/0x190 [xfs]
xfs_trans_read_buf_map+0x13d/0x520 [xfs]
xfs_read_agi+0xa8/0x160 [xfs]
xfs_iunlink_remove+0x6f/0x2a0 [xfs]
? current_time+0x46/0x80
? xfs_trans_ichgtime+0x39/0xb0 [xfs]
xfs_rename+0x57a/0xae0 [xfs]
xfs_vn_rename+0xe4/0x150 [xfs]
...

In this patch we move the xfs_iunlink_remove() call to
before acquiring the AGF lock to preserve correct AGI/AGF locking
order.

Signed-off-by: kaixuxia
Reviewed-by: Brian Foster
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

kaixuxia
2019-09-04 12:07:25 +0800
76f179335 xfs: define a flags field for the AG geometry ioctl structure ... Browse Code »

Define a flags field for the AG geometry ioctl structure.

Signed-off-by: Darrick J. Wong
Reviewed-by: Dave Chinner

Darrick J. Wong
2019-09-04 12:07:25 +0800

03 Sep, 2019

1 commit

eb77b23b5 xfs: add a xfs_valid_startblock helper ... Browse Code »

Add a helper that validates the startblock is valid. This checks for a
non-zero block on the main device, but skips that check for blocks on
the realtime device.

Signed-off-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Christoph Hellwig
2019-09-03 23:13:13 +0800

31 Aug, 2019

1 commit

1baa2800e xfs: remove the unused XFS_ALLOC_USERDATA flag ... Browse Code »

Signed-off-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Christoph Hellwig
2019-08-31 13:43:58 +0800