Eric Lee / smarc-fsl-linux-kernel

07 Oct, 2017

3 commits

eab26ad19 Merge tag 'xfs-4.14-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux ... Browse Code »

Pull xfs fixes from Darrick Wong:

- fix a race between overlapping copy on write aio

- fix cow fork swapping when we defragment reflinked files

* tag 'xfs-4.14-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: handle racy AIO in xfs_reflink_end_cow
xfs: always swap the cow forks when swapping extents

Linus Torvalds
2017-10-07 06:53:36 +0800
bf2db0b9f Merge branch 'for-4.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux ... Browse Code »

Pull btrfs fixes from David Sterba:
"Two more fixes for bugs introduced in 4.13.

The sector_t problem with 32bit architecture and !LBDAF config seems
serious but the number of affected deployments is hopefully low.

The clashing status bits could lead to a confusing in-memory state of
the whole-filesystem operations if used with the quota override sysfs
knob"

* 'for-4.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix overlap of fs_info::flags values
btrfs: avoid overflow when sector_t is 32 bit

Linus Torvalds
2017-10-07 00:03:08 +0800
b77779b93 Merge tag 'ceph-for-4.14-rc4' of git://github.com/ceph/ceph-client ... Browse Code »

Pull ceph fixes from Ilya Dryomov:
"Two fixups for CephFS snapshot-handling patches in -rc1"

* tag 'ceph-for-4.14-rc4' of git://github.com/ceph/ceph-client:
ceph: fix __choose_mds() for LSSNAP request
ceph: properly queue cap snap for newly created snap realm

Linus Torvalds
2017-10-07 00:01:45 +0800

06 Oct, 2017

1 commit

8d4ef4e15 Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs ... Browse Code »

Pull overlayfs fixes from Miklos Szeredi:
"Fix a regression in 4.14 and one in 4.13. The latter is a case when
Docker is doing something it really shouldn't and gets away with it.
We now print a warning instead of erroring out.

There are also fixes to several error paths"

* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: fix regression caused by exclusive upper/work dir protection
ovl: fix missing unlock_rename() in ovl_do_copy_up()
ovl: fix dentry leak in ovl_indexdir_cleanup()
ovl: fix dput() of ERR_PTR in ovl_cleanup_index()
ovl: fix error value printed in ovl_lookup_index()
ovl: fix may_write_real() for overlayfs directories

Linus Torvalds
2017-10-06 23:52:53 +0800

05 Oct, 2017

7 commits

85fdee1ee ovl: fix regression caused by exclusive upper/work dir protection ... Browse Code »

Enforcing exclusive ownership on upper/work dirs caused a docker
regression: https://github.com/moby/moby/issues/34672.

Euan spotted the regression and pointed to the offending commit.
Vivek has brought the regression to my attention and provided this
reproducer:

Terminal 1:

mount -t overlay -o workdir=work,lowerdir=lower,upperdir=upper none
merged/

Terminal 2:

unshare -m

Terminal 1:

umount merged
mount -t overlay -o workdir=work,lowerdir=lower,upperdir=upper none
merged/
mount: /root/overlay-testing/merged: none already mounted or mount point
busy

To fix the regression, I replaced the error with an alarming warning.
With index feature enabled, mount does fail, but logs a suggestion to
override exclusive dir protection by disabling index.
Note that index=off mount does take the inuse locks, so a concurrent
index=off will issue the warning and a concurrent index=on mount will fail.

Documentation was updated to reflect this change.

Fixes: 2cac0c00a6cd ("ovl: get exclusive ownership on upper/work dirs")
Cc: # v4.13
Reported-by: Euan Kemp
Reported-by: Vivek Goyal
Signed-off-by: Amir Goldstein
Signed-off-by: Miklos Szeredi

Amir Goldstein
2017-10-05 21:53:18 +0800
5820dc088 ovl: fix missing unlock_rename() in ovl_do_copy_up() ... Browse Code »

Use the ovl_lock_rename_workdir() helper which requires
unlock_rename() only on lock success.

Fixes: ("fd210b7d67ee ovl: move copy up lock out")
Cc: # v4.13
Signed-off-by: Amir Goldstein
Signed-off-by: Miklos Szeredi

Amir Goldstein
2017-10-05 21:53:18 +0800
dc7ab6773 ovl: fix dentry leak in ovl_indexdir_cleanup() ... Browse Code »

index dentry was not released when breaking out of the loop
due to index verification error.

Fixes: 415543d5c64f ("ovl: cleanup bad and stale index entries on mount")
Cc: # v4.13
Signed-off-by: Amir Goldstein
Signed-off-by: Miklos Szeredi

Amir Goldstein
2017-10-05 21:53:18 +0800
9f4ec904d ovl: fix dput() of ERR_PTR in ovl_cleanup_index() ... Browse Code »

Fixes: caf70cb2ba5d ("ovl: cleanup orphan index entries")
Cc: # v4.13
Signed-off-by: Amir Goldstein
Signed-off-by: Miklos Szeredi

Amir Goldstein
2017-10-05 21:53:18 +0800
e0082a0f0 ovl: fix error value printed in ovl_lookup_index() ... Browse Code »

Fixes: 359f392ca53e ("ovl: lookup index entry for copy up origin")
Cc: # v4.13
Signed-off-by: Amir Goldstein
Signed-off-by: Miklos Szeredi

Amir Goldstein
2017-10-05 21:53:18 +0800
954c736f8 ovl: fix may_write_real() for overlayfs directories ... Browse Code »

Overlayfs directory file_inode() is the overlay inode whether the real
inode is upper or lower.

This fixes a regression in xfstest generic/158.

Fixes: 7c6893e3c9ab ("ovl: don't allow writing ioctl on lower layer")
Signed-off-by: Amir Goldstein
Signed-off-by: Miklos Szeredi

Amir Goldstein
2017-10-05 21:53:18 +0800
b7e141644 Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge misc fixes from Andrew Morton:
"A lot of stuff, sorry about that. A week on a beach, then a bunch of
time catching up then more time letting it bake in -next. Shan't do
that again!"

* emailed patches from Andrew Morton : (51 commits)
include/linux/fs.h: fix comment about struct address_space
checkpatch: fix ignoring cover-letter logic
m32r: fix build failure
lib/ratelimit.c: use deferred printk() version
kernel/params.c: improve STANDARD_PARAM_DEF readability
kernel/params.c: fix an overflow in param_attr_show
kernel/params.c: fix the maximum length in param_get_string
mm/memory_hotplug: define find_{smallest|biggest}_section_pfn as unsigned long
mm/memory_hotplug: change pfn_to_section_nr/section_nr_to_pfn macro to inline function
kernel/kcmp.c: drop branch leftover typo
memremap: add scheduling point to devm_memremap_pages
mm, page_alloc: add scheduling point to memmap_init_zone
mm, memory_hotplug: add scheduling point to __add_pages
lib/idr.c: fix comment for idr_replace()
mm: memcontrol: use vmalloc fallback for large kmem memcg arrays
kernel/sysctl.c: remove duplicate UINT_MAX check on do_proc_douintvec_conv()
include/linux/bitfield.h: remove 32bit from FIELD_GET comment block
lib/lz4: make arrays static const, reduces object code size
exec: binfmt_misc: kill the onstack iname[BINPRM_BUF_SIZE] array
exec: binfmt_misc: fix race between load_misc_binary() and kill_node()
...

Linus Torvalds
2017-10-05 00:30:50 +0800

04 Oct, 2017

12 commits

69ad59767 Btrfs: fix overlap of fs_info::flags values ... Browse Code »

Because the values of BTRFS_FS_EXCL_OP and BTRFS_FS_QUOTA_OVERRIDE overlap,
we should change the value.

First, BTRFS_FS_EXCL_OP was set to 14.

commit 171938e52807 ("btrfs: track exclusive filesystem operation in flags")

Next, the value of BTRFS_FS_QUOTA_OVERRIDE was set to 14.

commit f29efe292198 ("btrfs: add quota override flag to enable quota override for CAP_SYS_RESOURCE")

As a result, the value 14 overlapped, by accident.
This problem is solved by defining the value of BTRFS_FS_EXCL_OP as 16,
the flags are internal.

Fixes: f29efe292198 ("btrfs: add quota override flag to enable quota override for CAP_SYS_RESOURCE")
CC: stable@vger.kernel.org # 4.13+
Signed-off-by: Tsutomu Itoh
Reviewed-by: David Sterba
[ minimize the change, update only BTRFS_FS_EXCL_OP ]
Signed-off-by: David Sterba

Tsutomu Itoh
2017-10-04 22:44:18 +0800
2d8ce70a0 btrfs: avoid overflow when sector_t is 32 bit ... Browse Code »

Jean-Denis Girard noticed commit c821e7f3 "pass bytes to
btrfs_bio_alloc" (https://patchwork.kernel.org/patch/9763081/)
introduces a regression on 32 bit machines.
When CONFIG_LBDAF is _not_ defined (CONFIG_LBDAF == Support for large
(2TB+) block devices and files) sector_t is 32 bit on 32bit machines.

In the function submit_extent_page, 'sector' (which is sector_t type) is
multiplied by 512 to convert it from sectors to bytes, leading to an
overflow when the disk is bigger than 4GB (!).

I added a cast to u64 to avoid overflow.

Fixes: c821e7f3 ("btrfs: pass bytes to btrfs_bio_alloc")
CC: stable@vger.kernel.org # 4.13+
Signed-off-by: Goffredo Baroncelli
Tested-by: Jean-Denis Girard
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Goffredo Baroncelli
2017-10-04 22:22:56 +0800
57e7ba04d lsm: fix smack_inode_removexattr and xattr_getsecurity memleak ... Browse Code »

security_inode_getsecurity() provides the text string value
of a security attribute. It does not provide a "secctx".
The code in xattr_getsecurity() that calls security_inode_getsecurity()
and then calls security_release_secctx() happened to work because
SElinux and Smack treat the attribute and the secctx the same way.
It fails for cap_inode_getsecurity(), because that module has no
secctx that ever needs releasing. It turns out that Smack is the
one that's doing things wrong by not allocating memory when instructed
to do so by the "alloc" parameter.

The fix is simple enough. Change the security_release_secctx() to
kfree() because it isn't a secctx being returned by
security_inode_getsecurity(). Change Smack to allocate the string when
told to do so.

Note: this also fixes memory leaks for LSMs which implement
inode_getsecurity but not release_secctx, such as capabilities.

Signed-off-by: Casey Schaufler
Reported-by: Konstantin Khlebnikov
Cc: stable@vger.kernel.org
Signed-off-by: James Morris

Casey Schaufler
2017-10-04 15:03:15 +0800
e12199f85 xfs: handle racy AIO in xfs_reflink_end_cow ... Browse Code »

If we got two AIO writes into a COW area the second one might not have any
COW extents left to convert. Handle that case gracefully instead of
triggering an assert or accessing beyond the bounds of the extent list.

Signed-off-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Christoph Hellwig
2017-10-04 12:27:55 +0800
52bfcdd7a xfs: always swap the cow forks when swapping extents ... Browse Code »

Since the CoW fork exists as a secondary data structure to the data
fork, we must always swap cow forks during swapext. We also need to
swap the extent counts and reset the cowblocks tags.

Reviewed-by: Brian Foster
Reviewed-by: Christoph Hellwig
Signed-off-by: Darrick J. Wong

Darrick J. Wong
2017-10-04 12:27:55 +0800
50097f749 exec: binfmt_misc: kill the onstack iname[BINPRM_BUF_SIZE] array ... Browse Code »

After the previous change "fmt" can't go away, we can kill
iname/iname_addr and use fmt->interpreter.

Link: http://lkml.kernel.org/r/20170922143653.GA17232@redhat.com
Signed-off-by: Oleg Nesterov
Acked-by: Kees Cook
Cc: Al Viro
Cc: Ben Woodard
Cc: James Bottomley
Cc: Jim Foraker
Cc:
Cc: Travis Gummels
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2017-10-04 08:54:25 +0800
43a4f2619 exec: binfmt_misc: fix race between load_misc_binary() and kill_node() ... Browse Code »

load_misc_binary() makes a local copy of fmt->interpreter under
entries_lock to avoid the race with kill_node() but this is not enough;
the whole Node can be freed after we drop entries_lock, not only the
->interpreter string.

Add dget/dput(fmt->dentry) to ensure bm_evict_inode() can't destroy/free
this Node.

Link: http://lkml.kernel.org/r/20170922143650.GA17227@redhat.com
Signed-off-by: Oleg Nesterov
Acked-by: Kees Cook
Cc: Al Viro
Cc: Ben Woodard
Cc: James Bottomley
Cc: Jim Foraker
Cc: Travis Gummels
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2017-10-04 08:54:25 +0800
eb23aa031 exec: binfmt_misc: remove the confusing e->interp_file != NULL checks ... Browse Code »

If MISC_FMT_OPEN_FILE flag is set e->interp_file must be valid or we
have a bug which should not be silently ignored.

Link: http://lkml.kernel.org/r/20170922143647.GA17222@redhat.com
Signed-off-by: Oleg Nesterov
Acked-by: Kees Cook
Cc: Al Viro
Cc: Ben Woodard
Cc: James Bottomley
Cc: Jim Foraker
Cc:
Cc: Travis Gummels
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2017-10-04 08:54:25 +0800
83f918274 exec: binfmt_misc: shift filp_close(interp_file) from kill_node() to bm_evict_inode() ... Browse Code »

To ensure that load_misc_binary() can't use the partially destroyed
Node, see also the next patch.

The current logic looks wrong in any case, once we close interp_file it
doesn't make any sense to delay kfree(inode->i_private), this Node is no
longer valid. Even if the MISC_FMT_OPEN_FILE/interp_file checks were
not racy (they are), load_misc_binary() should not try to reopen
->interpreter if MISC_FMT_OPEN_FILE is set but ->interp_file is NULL.

And I can't understand why do we use filp_close(), not fput().

Link: http://lkml.kernel.org/r/20170922143644.GA17216@redhat.com
Signed-off-by: Oleg Nesterov
Acked-by: Kees Cook
Cc: Al Viro
Cc: Ben Woodard
Cc: James Bottomley
Cc: Jim Foraker
Cc:
Cc: Travis Gummels
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2017-10-04 08:54:25 +0800
baba1b297 exec: binfmt_misc: don't nullify Node->dentry in kill_node() ... Browse Code »

kill_node() nullifies/checks Node->dentry to avoid double free. This
complicates the next changes and this is very confusing:

- we do not need to check dentry != NULL under entries_lock,
kill_node() is always called under inode_lock(d_inode(root)) and we
rely on this inode_lock() anyway, without this lock the
MISC_FMT_OPEN_FILE cleanup could race with itself.

- if kill_inode() was already called and ->dentry == NULL we should not
even try to close e->interp_file.

We can change bm_entry_write() to simply check !list_empty(list) before
kill_node. Again, we rely on inode_lock(), in particular it saves us
from the race with bm_status_write(), another caller of kill_node().

Link: http://lkml.kernel.org/r/20170922143641.GA17210@redhat.com
Signed-off-by: Oleg Nesterov
Acked-by: Kees Cook
Cc: Al Viro
Cc: Ben Woodard
Cc: James Bottomley
Cc: Jim Foraker
Cc:
Cc: Travis Gummels
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2017-10-04 08:54:25 +0800
c2315c187 exec: load_script: kill the onstack interp[BINPRM_BUF_SIZE] array ... Browse Code »

Patch series "exec: binfmt_misc: fix use-after-free, kill
iname[BINPRM_BUF_SIZE]".

It looks like this code was always wrong, then commit 948b701a607f
("binfmt_misc: add persistent opened binary handler for containers")
added more problems.

This patch (of 6):

load_script() can simply use i_name instead, it points into bprm->buf[]
and nobody can change this memory until we call prepare_binprm().

The only complication is that we need to also change the signature of
bprm_change_interp() but this change looks good too.

While at it, do whitespace/style cleanups.

NOTE: the real motivation for this change is that people want to
increase BINPRM_BUF_SIZE, we need to change load_misc_binary() too but
this looks more complicated because afaics it is very buggy.

Link: http://lkml.kernel.org/r/20170918163446.GA26793@redhat.com
Signed-off-by: Oleg Nesterov
Acked-by: Kees Cook
Cc: Travis Gummels
Cc: Ben Woodard
Cc: Jim Foraker
Cc:
Cc: Al Viro
Cc: James Bottomley
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2017-10-04 08:54:25 +0800
384632e67 userfaultfd: non-cooperative: fix fork use after free ... Browse Code »

When reading the event from the uffd, we put it on a temporary
fork_event list to detect if we can still access it after releasing and
retaking the event_wqh.lock.

If fork aborts and removes the event from the fork_event all is fine as
long as we're still in the userfault read context and fork_event head is
still alive.

We've to put the event allocated in the fork kernel stack, back from
fork_event list-head to the event_wqh head, before returning from
userfaultfd_ctx_read, because the fork_event head lifetime is limited to
the userfaultfd_ctx_read stack lifetime.

Forgetting to move the event back to its event_wqh place then results in
__remove_wait_queue(&ctx->event_wqh, &ewq->wq); in
userfaultfd_event_wait_completion to remove it from a head that has been
already freed from the reader stack.

This could only happen if resolve_userfault_fork failed (for example if
there are no file descriptors available to allocate the fork uffd). If
it succeeded it was put back correctly.

Furthermore, after find_userfault_evt receives a fork event, the forked
userfault context in fork_nctx and uwq->msg.arg.reserved.reserved1 can
be released by the fork thread as soon as the event_wqh.lock is
released. Taking a reference on the fork_nctx before dropping the lock
prevents an use after free in resolve_userfault_fork().

If the fork side aborted and it already released everything, we still
try to succeed resolve_userfault_fork(), if possible.

Fixes: 893e26e61d04eac9 ("userfaultfd: non-cooperative: Add fork() event")
Link: http://lkml.kernel.org/r/20170920180413.26713-1-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli
Reported-by: Mark Rutland
Tested-by: Mark Rutland
Cc: Pavel Emelyanov
Cc: Mike Rapoport
Cc: "Dr. David Alan Gilbert"
Cc: Mike Kravetz
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2017-10-04 08:54:25 +0800

02 Oct, 2017

3 commits

38f340ccd ceph: fix __choose_mds() for LSSNAP request ... Browse Code »

previous commit 5d37ca14 "ceph: send LSSNAP request to auth mds
of directory inode" is buggy. It makes __choose_mds() choose mds
base on hash of '.snap' dentry.

Signed-off-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov

Yan, Zheng
2017-10-02 22:18:16 +0800
9f4057fc9 ceph: properly queue cap snap for newly created snap realm ... Browse Code »

commit 3ae0bebc "ceph: queue cap snap only when snap realm's
context changes" introduced a regression: we may not call
queue_realm_cap_snaps() for newly created snap realm. This
regression allows unflushed snapshot data to be overwritten.

Link: http://tracker.ceph.com/issues/21483
Signed-off-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov

Yan, Zheng
2017-10-02 22:18:01 +0800
7e103ace9 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler fixes from Thomas Gleixner:
"The scheduler pull request comes with the following updates:

- Prevent a divide by zero issue by validating the input value of
sysctl_sched_time_avg

- Make task state printing consistent all over the place and have
explicit state characters for IDLE and PARKED so they wont be
displayed as 'D' state which confuses tools"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/sysctl: Check user input value of sysctl_sched_time_avg
sched/debug: Add explicit TASK_PARKED printing
sched/debug: Ignore TASK_IDLE for SysRq-W
sched/debug: Add explicit TASK_IDLE printing
sched/tracing: Use common task-state helpers
sched/tracing: Fix trace_sched_switch task-state printing
sched/debug: Remove unused variable
sched/debug: Convert TASK_state to hex
sched/debug: Implement consistent task-state printing

Linus Torvalds
2017-10-02 03:10:02 +0800

30 Sep, 2017

1 commit

5ba88cd6e Merge branch 'for-4.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux ... Browse Code »

Pull btrfs fixes from David Sterba:
"We've collected a bunch of isolated fixes, for crashes, user-visible
behaviour or missing bits from other subsystem cleanups from the past.

The overall number is not small but I was not able to make it
significantly smaller. Most of the patches are supposed to go to
stable"

* 'for-4.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: log csums for all modified extents
Btrfs: fix unexpected result when dio reading corrupted blocks
btrfs: Report error on removing qgroup if del_qgroup_item fails
Btrfs: skip checksum when reading compressed data if some IO have failed
Btrfs: fix kernel oops while reading compressed data
Btrfs: use btrfs_op instead of bio_op in __btrfs_map_block
Btrfs: do not backup tree roots when fsync
btrfs: remove BTRFS_FS_QUOTA_DISABLING flag
btrfs: propagate error to btrfs_cmp_data_prepare caller
btrfs: prevent to set invalid default subvolid
Btrfs: send: fix error number for unknown inode types
btrfs: fix NULL pointer dereference from free_reloc_roots()
btrfs: finish ordered extent cleaning if no progress is found
btrfs: clear ordered flag on cleaning up ordered extents
Btrfs: fix incorrect {node,sector}size endianness from BTRFS_IOC_FS_INFO
Btrfs: do not reset bio->bi_ops while writing bio
Btrfs: use the new helper wbc_to_write_flags

Linus Torvalds
2017-09-30 03:57:35 +0800

29 Sep, 2017

4 commits

8ef9925b0 sched/debug: Add explicit TASK_PARKED printing ... Browse Code »

Currently TASK_PARKED is masqueraded as TASK_INTERRUPTIBLE, give it
its own print state because it will not in fact get woken by regular
wakeups and is a long-term state.

This requires moving TASK_PARKED into the TASK_REPORT mask, and since
that latter needs to be a contiguous bitmask, we need to shuffle the
bits around a bit.

Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2017-09-29 17:02:57 +0800
06eb61844 sched/debug: Add explicit TASK_IDLE printing ... Browse Code »

Markus reported that kthreads that idle using TASK_IDLE instead of
TASK_INTERRUPTIBLE are reported in as TASK_UNINTERRUPTIBLE and things
like htop mark those red.

This is undesirable, so add an explicit state for TASK_IDLE.

Reported-by: Markus Trippelsdorf
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2017-09-29 17:02:56 +0800
1593baab9 sched/debug: Implement consistent task-state printing ... Browse Code »

Currently get_task_state() and task_state_to_char() report different
states, create a number of common helpers and unify the reported state
space.

Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2017-09-29 16:09:08 +0800
02a2b0539 Merge tag 'xfs-4.14-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux ... Browse Code »

Pull xfs fixes from Darrick Wong:

- fix various problems with the copy-on-write extent maps getting freed
at the wrong time

- fix printk format specifier problems

- report zeroing operation outcomes instead of dropping them on the
floor

- fix some crashes when dio operations partially fail

- fix a race condition between unwritten extent conversion & dio read

- fix some incorrect tests in the inode log item processing

- correct the delayed allocation space reservations on rmap filesystems

- fix some problems checking for dax support

* tag 'xfs-4.14-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: revert "xfs: factor rmap btree size into the indlen calculations"
xfs: Capture state of the right inode in xfs_iflush_done
xfs: perag initialization should only touch m_ag_max_usable for AG 0
xfs: update i_size after unwritten conversion in dio completion
iomap_dio_rw: Allocate AIO completion queue before submitting dio
xfs: validate bdev support for DAX inode flag
xfs: remove redundant re-initialization of total_nr_pages
xfs: Output warning message when discard option was enabled even though the device does not support discard
xfs: report zeroed or not correctly in xfs_zero_range()
xfs: kill meaningless variable 'zero'
fs/xfs: Use %pS printk format for direct addresses
xfs: evict CoW fork extents when performing finsert/fcollapse
xfs: don't unconditionally clear the reflink flag on zero-block files

Linus Torvalds
2017-09-29 04:27:23 +0800

28 Sep, 2017

1 commit

9cd6681cb Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs ... Browse Code »

Pull quota and isofs fixes from Jan Kara:
"Two quota fixes (fallout of the quota locking changes) and an isofs
build fix"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
quota: Fix quota corruption with generic/232 test
isofs: fix build regression
quota: add missing lock into __dquot_transfer()

Linus Torvalds
2017-09-28 03:22:12 +0800

27 Sep, 2017

8 commits

4c6bb6966 quota: Fix quota corruption with generic/232 test ... Browse Code »

Eric has reported that since commit d2faa415166b "quota: Do not acquire
dqio_sem for dquot overwrites in v2 format" test generic/232
occasionally fails due to quota information being incorrect. Indeed that
commit was too eager to remove dqio_sem completely from the path that
just overwrites quota structure with updated information. Although that
is innocent on its own, another process that inserts new quota structure
to the same block can perform read-modify-write cycle of that block thus
effectively discarding quota information update if they race in a wrong
way.

Fix the problem by acquiring dqio_sem for reading for overwrites of
quota structure. Note that it *is* possible to completely avoid taking
dqio_sem in the overwrite path however that will require modifying path
inserting / deleting quota structures to avoid RMW cycles of the full
block and for now it is not clear whether it is worth the hassle.

Fixes: d2faa415166b2883428efa92f451774ef44373ac
Reported-and-tested-by: Eric Whitney
Signed-off-by: Jan Kara

Jan Kara
2017-09-27 17:33:47 +0800
fc46820b2 vfs: Return -ENXIO for negative SEEK_HOLE / SEEK_DATA offsets ... Browse Code »

In generic_file_llseek_size, return -ENXIO for negative offsets as well
as offsets beyond EOF. This affects filesystems which don't implement
SEEK_HOLE / SEEK_DATA internally, possibly because they don't support
holes.

Fixes xfstest generic/448.

Signed-off-by: Andreas Gruenbacher
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds

Andreas Gruenbacher
2017-09-27 04:46:06 +0800
5e5c943c1 xfs: revert "xfs: factor rmap btree size into the indlen calculations" ... Browse Code »

In commit fd26a88093ba we added a worst case estimate for rmapbt blocks
needed to satisfy the block mapping request. Since then, we added the
ability to reserve enough space in each AG such that we should never run
out of blocks to grow the rmapbt, which makes this calculation
unnecessary. Revert the commit because it makes the extra delalloc
indlen accounting unnecessary and incorrect.

Reported-by: Eryu Guan
Reviewed-by: Brian Foster
Reviewed-by: Christoph Hellwig
Signed-off-by: Darrick J. Wong

Darrick J. Wong
2017-09-27 01:55:20 +0800
842f6e9f7 xfs: Capture state of the right inode in xfs_iflush_done ... Browse Code »

My previous patch: d3a304b6292168b83b45d624784f973fdc1ca674 check for
XFS_LI_FAILED flag xfs_iflush done, so the failed item can be properly
resubmitted.

In the loop scanning other inodes being completed, it should check the
current item for the XFS_LI_FAILED, and not the initial one.

The state of the initial inode is checked after the loop ends

Kudos to Eric for catching this.

Signed-off-by: Carlos Maiolino
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Carlos Maiolino
2017-09-27 01:55:20 +0800
9789dd9e1 xfs: perag initialization should only touch m_ag_max_usable for AG 0 ... Browse Code »

We call __xfs_ag_resv_init to make a per-AG reservation for each AG.
This makes the reservation per-AG, not per-filesystem. Therefore, it
is incorrect to adjust m_ag_max_usable for each AG. Adjust it only
when we're reserving AG 0's blocks so that we only do it once per fs.

Signed-off-by: Darrick J. Wong
Reviewed-by: Brian Foster

Darrick J. Wong
2017-09-27 01:55:19 +0800
ee70daaba xfs: update i_size after unwritten conversion in dio completion ... Browse Code »

Since commit d531d91d6990 ("xfs: always use unwritten extents for
direct I/O writes"), we start allocating unwritten extents for all
direct writes to allow appending aio in XFS.

But for dio writes that could extend file size we update the in-core
inode size first, then convert the unwritten extents to real
allocations at dio completion time in xfs_dio_write_end_io(). Thus a
racing direct read could see the new i_size and find the unwritten
extents first and read zeros instead of actual data, if the direct
writer also takes a shared iolock.

Fix it by updating the in-core inode size after the unwritten extent
conversion. To do this, introduce a new boolean argument to
xfs_iomap_write_unwritten() to tell if we want to update in-core
i_size or not.

Suggested-by: Brian Foster
Reviewed-by: Brian Foster
Signed-off-by: Eryu Guan
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Eryu Guan
2017-09-27 01:55:19 +0800
546e7be82 iomap_dio_rw: Allocate AIO completion queue before submitting dio ... Browse Code »

Executing xfs/104 test in a loop on Linux-v4.13 kernel on a ppc64
machine can cause the following NULL pointer dereference,

.queue_work_on+0x4c/0x80
.iomap_dio_bio_end_io+0xbc/0x1f0
.bio_endio+0x118/0x1f0
.blk_update_request+0xd0/0x470
.blk_mq_end_request+0x24/0xc0
.lo_complete_rq+0x40/0xe0
.__blk_mq_complete_request_remote+0x28/0x40
.flush_smp_call_function_queue+0xc4/0x1e0
.smp_ipi_demux_relaxed+0x8c/0x100
.icp_hv_ipi_action+0x54/0xa0
.__handle_irq_event_percpu+0x84/0x2c0
.handle_irq_event_percpu+0x28/0x80
.handle_percpu_irq+0x78/0xc0
.generic_handle_irq+0x40/0x70
.__do_irq+0x88/0x200
.call_do_irq+0x14/0x24
.do_IRQ+0x84/0x130

This occurs due to the following sequence of events,

1. Allocate dio for Direct I/O write.
2. Invoke iomap_apply() until iov_iter_count() bytes have been submitted.
- Assume that we have submitted atleast one bio. Hence iomap_dio->ref value
will be >= 2.
- If during the second iteration, iomap_apply() ends up returning -ENOSPC, we would
break out of the loop and since the 'ret' value is a negative number we
end up not allocating memory for super_block->s_dio_done_wq.
3. Meanwhile, iomap_dio_bio_end_io() is invoked for bios that have been
submitted and here the code ends up dereferencing the NULL pointer stored
at super_block->s_dio_done_wq.

This commit fixes the bug by allocating memory for
super_block->s_dio_done_wq before iomap_apply() is invoked.

Reported-by: Eryu Guan
Reviewed-by: Christoph Hellwig
Tested-by: Eryu Guan
Signed-off-by: Chandan Rajendra
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Chandan Rajendra
2017-09-27 01:55:19 +0800
6851a3db7 xfs: validate bdev support for DAX inode flag ... Browse Code »

Currently only the blocksize is checked, but we should really be calling
bdev_dax_supported() which also tests to make sure we can get a
struct dax_device and that the dax_direct_access() path is working.

This is the same check that we do for the "-o dax" mount option in
xfs_fs_fill_super().

This does not fix the race issues that caused the XFS DAX inode option to
be disabled, so that option will still be disabled. If/when we re-enable
it, though, I think we will want this issue to have been fixed. I also do
think that we want to fix this in stable kernels.

Signed-off-by: Ross Zwisler
CC: stable@vger.kernel.org
Reviewed-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Ross Zwisler
2017-09-27 01:55:19 +0800