Eric Lee / smarc-fsl-linux-kernel

15 Jan, 2017

5 commits

ee99e2bc5 ipv6: handle -EFAULT from skb_copy_bits ... Browse Code »

[ Upstream commit a98f91758995cb59611e61318dddd8a6956b52c3 ]

By setting certain socket options on ipv6 raw sockets, we can confuse the
length calculation in rawv6_push_pending_frames triggering a BUG_ON.

RIP: 0010:[] [] rawv6_sendmsg+0xc30/0xc40
RSP: 0018:ffff881f6c4a7c18 EFLAGS: 00010282
RAX: 00000000fffffff2 RBX: ffff881f6c681680 RCX: 0000000000000002
RDX: ffff881f6c4a7cf8 RSI: 0000000000000030 RDI: ffff881fed0f6a00
RBP: ffff881f6c4a7da8 R08: 0000000000000000 R09: 0000000000000009
R10: ffff881fed0f6a00 R11: 0000000000000009 R12: 0000000000000030
R13: ffff881fed0f6a00 R14: ffff881fee39ba00 R15: ffff881fefa93a80

Call Trace:
[] ? unmap_page_range+0x693/0x830
[] inet_sendmsg+0x67/0xa0
[] sock_sendmsg+0x38/0x50
[] SYSC_sendto+0xef/0x170
[] SyS_sendto+0xe/0x10
[] do_syscall_64+0x50/0xa0
[] entry_SYSCALL64_slow_path+0x25/0x25

Handle by jumping to the failure path if skb_copy_bits gets an EFAULT.

Reproducer:

#include
#include
#include
#include
#include
#include
#include

#define LEN 504

int main(int argc, char* argv[])
{
int fd;
int zero = 0;
char buf[LEN];

memset(buf, 0, LEN);

fd = socket(AF_INET6, SOCK_RAW, 7);

setsockopt(fd, SOL_IPV6, IPV6_CHECKSUM, &zero, 4);
setsockopt(fd, SOL_IPV6, IPV6_DSTOPTS, &buf, LEN);

sendto(fd, buf, 1, 0, (struct sockaddr *) buf, 110);
}

Signed-off-by: Dave Jones
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Dave Jones
2017-01-15 20:42:53 +0800
d36a1cb1e inet: fix IP(V6)_RECVORIGDSTADDR for udp sockets ... Browse Code »

[ Upstream commit 39b2dd765e0711e1efd1d1df089473a8dd93ad48 ]

Socket cmsg IP(V6)_RECVORIGDSTADDR checks that port range lies within
the packet. For sockets that have transport headers pulled, transport
offset can be negative. Use signed comparison to avoid overflow.

Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing")
Reported-by: Nisar Jagabar
Signed-off-by: Willem de Bruijn
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Willem de Bruijn
2017-01-15 20:42:52 +0800
ed3cc329c sctp: sctp_transport_lookup_process should rcu_read_unlock when transport is null ... Browse Code »

[ Upstream commit 08abb79542c9e8c367d1d8e44fe1026868d3f0a7 ]

Prior to this patch, sctp_transport_lookup_process didn't rcu_read_unlock
when it failed to find a transport by sctp_addrs_lookup_transport.

This patch is to fix it by moving up rcu_read_unlock right before checking
transport and also to remove the out path.

Fixes: 1cceda784980 ("sctp: fix the issue sctp_diag uses lock_sock in rcu_read_lock")
Signed-off-by: Xin Long
Acked-by: Marcelo Ricardo Leitner
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Xin Long
2017-01-15 20:42:52 +0800
8b8fbe5c2 net: vrf: Drop conntrack data after pass through VRF device on Tx ... Browse Code »

[ Upstream commit eb63ecc1706b3e094d0f57438b6c2067cfc299f2 ]

Locally originated traffic in a VRF fails in the presence of a POSTROUTING
rule. For example,

$ iptables -t nat -A POSTROUTING -s 11.1.1.0/24 -j MASQUERADE
$ ping -I red -c1 11.1.1.3
ping: Warning: source address might be selected on device other than red.
PING 11.1.1.3 (11.1.1.3) from 11.1.1.2 red: 56(84) bytes of data.
ping: sendmsg: Operation not permitted

Worse, the above causes random corruption resulting in a panic in random
places (I have not seen a consistent backtrace).

Call nf_reset to drop the conntrack info following the pass through the
VRF device. The nf_reset is needed on Tx but not Rx because of the order
in which NF_HOOK's are hit: on Rx the VRF device is after the real ingress
device and on Tx it is is before the real egress device. Connection
tracking should be tied to the real egress device and not the VRF device.

Fixes: 8f58336d3f78a ("net: Add ethernet header for pass through VRF device")
Fixes: 35402e3136634 ("net: Add IPv6 support to VRF device")
Signed-off-by: David Ahern
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

David Ahern
2017-01-15 20:42:52 +0800
d4a0b2e40 net: vrf: Fix NAT within a VRF ... Browse Code »

[ Upstream commit a0f37efa82253994b99623dbf41eea8dd0ba169b ]

Connection tracking with VRF is broken because the pass through the VRF
device drops the connection tracking info. Removing the call to nf_reset
allows DNAT and MASQUERADE to work across interfaces within a VRF.

Fixes: 73e20b761acf ("net: vrf: Add support for PREROUTING rules on vrf device")
Signed-off-by: David Ahern
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

David Ahern
2017-01-15 20:42:52 +0800

12 Jan, 2017

35 commits

584fd7872 Linux 4.9.3 Browse Code »

Greg Kroah-Hartman
2017-01-12 18:41:42 +0800
3999c535d usb: gadget: composite: always set ep->mult to a sensible value ... Browse Code »

commit eaa496ffaaf19591fe471a36cef366146eeb9153 upstream.

ep->mult is supposed to be set to Isochronous and
Interrupt Endapoint's multiplier value. This value
is computed from different places depending on the
link speed.

If we're dealing with HighSpeed, then it's part of
bits [12:11] of wMaxPacketSize. This case wasn't
taken into consideration before.

While at that, also make sure the ep->mult defaults
to one so drivers can use it unconditionally and
assume they'll never multiply ep->maxpacket to zero.

Signed-off-by: Felipe Balbi
Signed-off-by: Greg Kroah-Hartman

Felipe Balbi
2017-01-12 18:39:46 +0800
7ff469ceb Revert "usb: gadget: composite: always set ep->mult to a sensible value" ... Browse Code »

This reverts commit eab1c4e2d0ad4509ccb8476a604074547dc202e0 which is
commit eaa496ffaaf19591fe471a36cef366146eeb9153 upstream as it was
incorrectly backported.

Reported-by: Bin Liu
Cc: Felipe Balbi
Signed-off-by: Greg Kroah-Hartman

Greg Kroah-Hartman
2017-01-12 18:39:46 +0800
ec3d5c521 Revert "rtlwifi: Fix enter/exit power_save" ... Browse Code »

This reverts commit 98068574928f499b30f136ff57ef9a03cc575a36, which is
commit ba9f93f82abafe2552eac942ebb11c2df4f8dd7f upstream as it causes
problems.

Reported-by: Dmitry Osipenko
Cc: Ping-Ke Shih
Cc: Larry Finger
Cc: Kalle Valo
Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org

Greg Kroah-Hartman
2017-01-12 18:39:45 +0800
cf365b117 tick/broadcast: Prevent NULL pointer dereference ... Browse Code »

commit c1a9eeb938b5433947e5ea22f89baff3182e7075 upstream.

When a disfunctional timer, e.g. dummy timer, is installed, the tick core
tries to setup the broadcast timer.

If no broadcast device is installed, the kernel crashes with a NULL pointer
dereference in tick_broadcast_setup_oneshot() because the function has no
sanity check.

Reported-by: Mason
Signed-off-by: Thomas Gleixner
Cc: Mark Rutland
Cc: Anna-Maria Gleixner
Cc: Richard Cochran
Cc: Sebastian Andrzej Siewior
Cc: Daniel Lezcano
Cc: Peter Zijlstra ,
Cc: Sebastian Frias
Cc: Thibaud Cornic
Cc: Robin Murphy
Link: http://lkml.kernel.org/r/1147ef90-7877-e4d2-bb2b-5c4fa8d3144b@free.fr
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2017-01-12 18:39:45 +0800
34db201f0 clocksource/dummy_timer: Move hotplug callback after the real timers ... Browse Code »

commit 9bf11ecce5a2758e5a097c2f3a13d08552d0d6f9 upstream.

When the dummy timer callback is invoked before the real timer callbacks,
then it tries to install that timer for the starting CPU. If the platform
does not have a broadcast timer installed the installation fails with a
kernel crash. The crash happens due to a unconditional deference of the non
available broadcast device. This needs to be fixed in the timer core code.

But even when this is fixed in the core code then installing the dummy
timer before the real timers is a pointless exercise.

Move it to the end of the callback list.

Fixes: 00c1d17aab51 ("clocksource/dummy_timer: Convert to hotplug state machine")
Reported-and-tested-by: Mason
Signed-off-by: Thomas Gleixner
Cc: Mark Rutland
Cc: Anna-Maria Gleixner
Cc: Richard Cochran
Cc: Sebastian Andrzej Siewior
Cc: Daniel Lezcano
Cc: Peter Zijlstra ,
Cc: Sebastian Frias
Cc: Thibaud Cornic
Cc: Robin Murphy
Link: http://lkml.kernel.org/r/1147ef90-7877-e4d2-bb2b-5c4fa8d3144b@free.fr
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2017-01-12 18:39:45 +0800
1b9c25568 xfs: fix max_retries _show and _store functions ... Browse Code »

commit ff97f2399edac1e0fb3fa7851d5fbcbdf04717cf upstream.

max_retries _show and _store functions should test against cfg->max_retries,
not cfg->retry_timeout

Signed-off-by: Carlos Maiolino
Reviewed-by: Eric Sandeen
Signed-off-by: Darrick J. Wong
Cc: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman

Carlos Maiolino
2017-01-12 18:39:45 +0800
91192ae41 xfs: fix crash and data corruption due to removal of busy COW extents ... Browse Code »

commit a1b7a4dea6166cf46be895bce4aac67ea5160fe8 upstream.

There is a race window between write_cache_pages calling
clear_page_dirty_for_io and XFS calling set_page_writeback, in which
the mapping for an inode is tagged neither as dirty, nor as writeback.

If the COW shrinker hits in exactly that window we'll remove the delayed
COW extents and writepages trying to write it back, which in release
kernels will manifest as corruption of the bmap btree, and in debug
kernels will trip the ASSERT about now calling xfs_bmapi_write with the
COWFORK flag for holes. A complex customer load manages to hit this
window fairly reliably, probably by always having COW writeback in flight
while the cow shrinker runs.

This patch adds another check for having the I_DIRTY_PAGES flag set,
which is still set during this race window. While this fixes the problem
I'm still not overly happy about the way the COW shrinker works as it
still seems a bit fragile.

Signed-off-by: Christoph Hellwig
Signed-off-by: Darrick J. Wong
Cc: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman

Christoph Hellwig
2017-01-12 18:39:45 +0800
b96e4e87d xfs: use the actual AG length when reserving blocks ... Browse Code »

commit 20e73b000bcded44a91b79429d8fa743247602ad upstream.

We need to use the actual AG length when making per-AG reservations,
since we could otherwise end up reserving more blocks out of the last
AG than there are actual blocks.

Complained-about-by: Brian Foster
Signed-off-by: Darrick J. Wong
Reviewed-by: Christoph Hellwig
Cc: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman

Darrick J. Wong
2017-01-12 18:39:44 +0800
d9c7c9fa6 xfs: fix double-cleanup when CUI recovery fails ... Browse Code »

commit 7a21272b088894070391a94fdd1c67014020fa1d upstream.

Dan Carpenter reported a double-free of rcur if _defer_finish fails
while we're recovering CUI items. Fix the error recovery to prevent
this.

Reported-by: Dan Carpenter
Signed-off-by: Darrick J. Wong
Cc: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman

Darrick J. Wong
2017-01-12 18:39:44 +0800
aa38f370b xfs: use GPF_NOFS when allocating btree cursors ... Browse Code »

commit b24a978c377be5f14e798cb41238e66fe51aab2f upstream.

Use NOFS for allocating btree cursors, since they can be called
under the ilock.

Signed-off-by: Darrick J. Wong
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner
Cc: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman

Darrick J. Wong
2017-01-12 18:39:44 +0800
3c382dda4 xfs: ignore leaf attr ichdr.count in verifier during log replay ... Browse Code »

commit 2e1d23370e75d7d89350d41b4ab58c7f6a0e26b2 upstream.

When we create a new attribute, we first create a shortform
attribute, and try to fit the new attribute into it.
If that fails, we copy the (empty) attribute into a leaf attribute,
and do the copy again. Thus there can be a transient state where
we have an empty leaf attribute.

If we encounter this during log replay, the verifier will fail.
So add a test to ignore this part of the leaf attr verification
during log replay.

Thanks as usual to dchinner for spotting the problem.

Signed-off-by: Eric Sandeen
Reviewed-by: Christoph Hellwig
Signed-off-by: Dave Chinner
Cc: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman

Eric Sandeen
2017-01-12 18:39:44 +0800
c00203386 xfs: don't cap maximum dedupe request length ... Browse Code »

commit 1bb33a98702d8360947f18a44349df75ba555d5d upstream.

After various discussions on linux-fsdevel, it has been decided that it
is not necessary to cap the length of a dedupe request, and that
correctly-written userspace client programs will be able to absorb the
change. Therefore, remove the length clamping behavior.

Signed-off-by: Darrick J. Wong
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner
Cc: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman

Darrick J. Wong
2017-01-12 18:39:44 +0800
f8b20705a xfs: don't allow di_size with high bit set ... Browse Code »

commit ef388e2054feedaeb05399ed654bdb06f385d294 upstream.

The on-disk field di_size is used to set i_size, which is a signed
integer of loff_t. If the high bit of di_size is set, we'll end up with
a negative i_size, which will cause all sorts of problems. Since the
VFS won't let us create a file with such length, we should catch them
here in the verifier too.

Signed-off-by: Darrick J. Wong
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner
Cc: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman

Darrick J. Wong
2017-01-12 18:39:43 +0800
12815dd15 xfs: error out if trying to add attrs and anextents > 0 ... Browse Code »

commit 0f352f8ee8412bd9d34fb2a6411241da61175c0e upstream.

We shouldn't assert if somehow we end up trying to add an attr fork to
an inode that apparently already has attr extents because this is an
indication of on-disk corruption. Instead, return an error code to
userspace.

Signed-off-by: Darrick J. Wong
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner
Cc: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman

Darrick J. Wong
2017-01-12 18:39:43 +0800
cd4bf1d41 xfs: don't crash if reading a directory results in an unexpected hole ... Browse Code »

commit 96a3aefb8ffde23180130460b0b2407b328eb727 upstream.

In xfs_dir3_data_read, we can encounter the situation where err == 0 and
*bpp == NULL if the given bno offset happens to be a hole; this leads to
a crash if we try to set the buffer type after the _da_read_buf call.
Holes can happen due to corrupt or malicious entries in the bmbt data,
so be a little more careful when we're handling buffers.

Signed-off-by: Darrick J. Wong
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner
Cc: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman

Darrick J. Wong
2017-01-12 18:39:43 +0800
b88398de1 xfs: complain if we don't get nextents bmap records ... Browse Code »

commit 356a3225222e5bc4df88aef3419fb6424f18ab69 upstream.

When reading into memory all extents of a btree-format inode fork,
complain if the number of extents we find is not the same as the number
of extents reported in the inode core. This is needed to stop an IO
action from accessing the garbage areas of the in-core fork.

[dchinner: removed redundant assert]

Signed-off-by: Darrick J. Wong
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner
Cc: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman

Darrick J. Wong
2017-01-12 18:39:43 +0800
4bb31bcce xfs: check for bogus values in btree block headers ... Browse Code »

commit bb3be7e7c1c18e1b141d4cadeb98cc89ecf78099 upstream.

When we're reading a btree block, make sure that what we retrieved
matches the owner and level; and has a plausible number of records.

Signed-off-by: Darrick J. Wong
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner
Cc: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman

Darrick J. Wong
2017-01-12 18:39:43 +0800
b85f32481 xfs: forbid AG btrees with level == 0 ... Browse Code »

commit d2a047f31e86941fa896e0e3271536d50aba415e upstream.

There is no such thing as a zero-level AG btree since even a single-node
zero-records btree has one level. Btree cursor constructors read
cur_nlevels straight from disk and then access things like
cur_bufs[cur_nlevels - 1] which is /really/ bad if cur_nlevels is zero!
Therefore, strengthen the verifiers to prevent this possibility.

Signed-off-by: Darrick J. Wong
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner
Cc: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman

Darrick J. Wong
2017-01-12 18:39:42 +0800
4081d4a79 xfs: handle cow fork in xfs_bmap_trace_exlist ... Browse Code »

commit c44a1f22626c153976289e1cd67bdcdfefc16e1f upstream.

By inspection, xfs_bmap_trace_exlist isn't handling cow forks,
and will trace the data fork instead.

Fix this by setting state appropriately if whichfork
== XFS_COW_FORK.

()___()
< @ @ >
| |
{o_o}
(|)

Signed-off-by: Eric Sandeen
Reviewed-by: Christoph Hellwig
Signed-off-by: Dave Chinner
Cc: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman

Eric Sandeen
2017-01-12 18:39:42 +0800
a585e1c4e xfs: pass state not whichfork to trace_xfs_extlist ... Browse Code »

commit 7710517fc37b1899722707883b54694ea710b3c0 upstream.

When xfs_bmap_trace_exlist called trace_xfs_extlist,
it sent in the "whichfork" var instead of the bmap "state"
as expected (even though state was already set up for this
purpose).

As a result, the xfs_bmap_class in tracing code used
"whichfork" not state in xfs_iext_state_to_fork(), and got
the wrong ifork pointer. It all goes downhill from
there, including an ASSERT when ifp_bytes is empty
by the time it reaches xfs_iext_get_ext():

XFS: Assertion failed: idx < ifp->if_bytes / sizeof(xfs_bmbt_rec_t)

Signed-off-by: Eric Sandeen
Reviewed-by: Christoph Hellwig
Signed-off-by: Dave Chinner
Cc: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman

Eric Sandeen
2017-01-12 18:39:42 +0800
bdbfd4ee6 xfs: Move AGI buffer type setting to xfs_read_agi ... Browse Code »

commit 200237d6746faaeaf7f4ff4abbf13f3917cee60a upstream.

We've missed properly setting the buffer type for
an AGI transaction in 3 spots now, so just move it
into xfs_read_agi() and set it if we are in a transaction
to avoid the problem in the future.

This is similar to how it is done in i.e. the dir3
and attr3 read functions.

Signed-off-by: Eric Sandeen
Reviewed-by: Brian Foster
Reviewed-by: Christoph Hellwig
Signed-off-by: Dave Chinner
Cc: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman

Eric Sandeen
2017-01-12 18:39:42 +0800
06ac11df9 xfs: pass post-eof speculative prealloc blocks to bmapi ... Browse Code »

commit f782088c9e5d08e9494c63e68b4e85716df3e5f8 upstream.

xfs_file_iomap_begin_delay() implements post-eof speculative
preallocation by extending the block count of the requested delayed
allocation. Now that xfs_bmapi_reserve_delalloc() has been updated to
handle prealloc blocks separately and tag the inode, update
xfs_file_iomap_begin_delay() to use the new parameter and rely on the
former to tag the inode.

Note that this patch does not change behavior.

Signed-off-by: Brian Foster
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner
Cc: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman

Brian Foster
2017-01-12 18:39:42 +0800
553937d3c xfs: use new extent lookup helpers xfs_file_iomap_begin_delay ... Browse Code »

commit 656152e552e5cbe0c11ad261b524376217c2fb13 upstream.

And only lookup the previous extent inside xfs_iomap_prealloc_size
if we actually need it.

Signed-off-by: Christoph Hellwig
Reviewed-by: Brian Foster
Signed-off-by: Dave Chinner
Cc: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman

Christoph Hellwig
2017-01-12 18:39:41 +0800
3d6e3b12b xfs: clean up cow fork reservation and tag inodes correctly ... Browse Code »

commit 0260d8ff5f76617e3a55a1c471383ecb4404c3ad upstream.

COW fork reservation is implemented via delayed allocation. The code is
modeled after the traditional delalloc allocation code, but is slightly
different in terms of how preallocation occurs. Rather than post-eof
speculative preallocation, COW fork preallocation is implemented via a
COW extent size hint that is designed to minimize fragmentation as a
reflinked file is split over time.

xfs_reflink_reserve_cow() still uses logic that is oriented towards
dealing with post-eof speculative preallocation, however, and is stale
or not necessarily correct. First, the EOF alignment to the COW extent
size hint is implemented in xfs_bmapi_reserve_delalloc() (which does so
correctly by aligning the start and end offsets) and so is not necessary
in xfs_reflink_reserve_cow(). The backoff and retry logic on ENOSPC is
also ineffective for the same reason, as xfs_bmapi_reserve_delalloc()
will simply perform the same allocation request on the retry. Finally,
since the COW extent size hint aligns the start and end offset of the
range to allocate, the end_fsb != orig_end_fsb logic is not sufficient.
Indeed, if a write request happens to end on an aligned offset, it is
possible that we do not tag the inode for COW preallocation even though
xfs_bmapi_reserve_delalloc() may have preallocated at the start offset.

Kill the unnecessary, duplicate code in xfs_reflink_reserve_cow().
Remove the inode tag logic as well since xfs_bmapi_reserve_delalloc()
has been updated to tag the inode correctly.

Signed-off-by: Brian Foster
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner
Cc: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman

Brian Foster
2017-01-12 18:39:41 +0800
4a323331d xfs: use new extent lookup helpers in __xfs_reflink_reserve_cow ... Browse Code »

commit 2755fc4438501c8c28e7783df890e889f6772bee upstream.

Signed-off-by: Christoph Hellwig
Reviewed-by: Brian Foster
Signed-off-by: Dave Chinner
Cc: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman

Christoph Hellwig
2017-01-12 18:39:41 +0800
cf168f2ff xfs: track preallocation separately in xfs_bmapi_reserve_delalloc() ... Browse Code »

commit 974ae922efd93b07b6cdf989ae959883f6f05fd8 upstream.

Speculative preallocation is currently processed entirely by the callers
of xfs_bmapi_reserve_delalloc(). The caller determines how much
preallocation to include, adjusts the extent length and passes down the
resulting request.

While this works fine for post-eof speculative preallocation, it is not
as reliable for COW fork preallocation. COW fork preallocation is
implemented via the cowextszhint, which aligns the start offset as well
as the length of the extent. Further, it is difficult for the caller to
accurately identify when preallocation occurs because the returned
extent could have been merged with neighboring extents in the fork.

To simplify this situation and facilitate further COW fork preallocation
enhancements, update xfs_bmapi_reserve_delalloc() to take a separate
preallocation parameter to incorporate into the allocation request. The
preallocation blocks value is tacked onto the end of the request and
adjusted to accommodate neighboring extents and extent size limits.
Since xfs_bmapi_reserve_delalloc() now knows precisely how much
preallocation was included in the allocation, it can also tag the inodes
appropriately to support preallocation reclaim.

Note that xfs_bmapi_reserve_delalloc() callers are not yet updated to
use the preallocation mechanism. This patch should not change behavior
outside of correctly tagging reflink inodes when start offset
preallocation occurs (which the caller does not handle correctly).

Signed-off-by: Brian Foster
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner
Cc: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman

Brian Foster
2017-01-12 18:39:41 +0800
cf4fb5104 xfs: remove prev argument to xfs_bmapi_reserve_delalloc ... Browse Code »

commit 65c5f419788d623a0410eca1866134f5e4628594 upstream.

We can easily lookup the previous extent for the cases where we need it,
which saves the callers from looking it up for us later in the series.

Signed-off-by: Christoph Hellwig
Reviewed-by: Brian Foster
Signed-off-by: Dave Chinner
Cc: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman

Christoph Hellwig
2017-01-12 18:39:41 +0800
390325766 xfs: always succeed when deduping zero bytes ... Browse Code »

commit fba3e594ef0ad911fa8f559732d588172f212d71 upstream.

It turns out that btrfs and xfs had differing interpretations of what
to do when the dedupe length is zero. Change xfs to follow btrfs'
semantics so that the userland interface is consistent.

Signed-off-by: Darrick J. Wong
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner
Cc: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman

Darrick J. Wong
2017-01-12 18:39:40 +0800
2b7dae91a xfs: factor rmap btree size into the indlen calculations ... Browse Code »

commit fd26a88093bab6529ea2de819114ca92dbd1d71d upstream.

When we're estimating the amount of space it's going to take to satisfy
a delalloc reservation, we need to include the space that we might need
to grow the rmapbt. This helps us to avoid running out of space later
when _iomap_write_allocate needs more space than we reserved. Eryu Guan
observed this happening on generic/224 when sunit/swidth were set.

Reported-by: Eryu Guan
Signed-off-by: Darrick J. Wong
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner
Cc: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman

Darrick J. Wong
2017-01-12 18:39:40 +0800
49dc19915 xfs: new inode extent list lookup helpers ... Browse Code »

commit 93533c7855c3c78c8a900cac65c8d669bb14935d upstream.

xfs_iext_lookup_extent looks up a single extent at the passed in offset,
and returns the extent covering the area, or the one behind it in case
of a hole, as well as the index of the returned extent in arguments,
as well as a simple bool as return value that is set to false if no
extent could be found because the offset is behind EOF. It is a simpler
replacement for xfs_bmap_search_extent that leaves looking up the rarely
needed previous extent to the caller and has a nicer calling convention.

xfs_iext_get_extent is a helper for iterating over the extent list,
it takes an extent index as input, and returns the extent at that index
in it's expanded form in an argument if it exists. The actual return
value is a bool whether the index is valid or not.

Signed-off-by: Christoph Hellwig
Reviewed-by: Brian Foster
Signed-off-by: Dave Chinner
Cc: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman

Christoph Hellwig
2017-01-12 18:39:40 +0800
b49ef758f xfs: fix unbalanced inode reclaim flush locking ... Browse Code »

commit 98efe8af1c9ffac47e842b7a75ded903e2f028da upstream.

Filesystem shutdown testing on an older distro kernel has uncovered an
imbalanced locking pattern for the inode flush lock in
xfs_reclaim_inode(). Specifically, there is a double unlock sequence
between the call to xfs_iflush_abort() and xfs_reclaim_inode() at the
"reclaim:" label.

This actually does not cause obvious problems on current kernels due to
the current flush lock implementation. Older kernels use a counting
based flush lock mechanism, however, which effectively breaks the lock
indefinitely when an already unlocked flush lock is repeatedly unlocked.
Though this only currently occurs on filesystem shutdown, it has
reproduced the effect of elevating an fs shutdown to a system-wide crash
or hang.

As it turns out, the flush lock is not actually required for the reclaim
logic in xfs_reclaim_inode() because by that time we have already cycled
the flush lock once while holding ILOCK_EXCL. Therefore, remove the
additional flush lock/unlock cycle around the 'reclaim:' label and
update branches into this label to release the flush lock where
appropriate. Add an assert to xfs_ifunlock() to help prevent future
occurences of the same problem.

Reported-by: Zorro Lang
Signed-off-by: Brian Foster
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner
Cc: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman

Brian Foster
2017-01-12 18:39:40 +0800
63fa793e7 xfs: check minimum block size for CRC filesystems ... Browse Code »

commit bec9d48d7a303a5bb95c05961ff07ec7eeb59058 upstream.

Check the minimum block size on v5 filesystems.

[dchinner: cleaned up XFS_MIN_CRC_BLOCKSIZE check]

Signed-off-by: Darrick J. Wong
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner
Cc: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman

Darrick J. Wong
2017-01-12 18:39:40 +0800
f380ee72a xfs: provide helper for counting extents from if_bytes ... Browse Code »

commit 5d829300bee000980a09ac2ccb761cb25867b67c upstream.

The open-coded pattern:

ifp->if_bytes / (uint)sizeof(xfs_bmbt_rec_t)

is all over the xfs code; provide a new helper
xfs_iext_count(ifp) to count the number of inline extents
in an inode fork.

[dchinner: pick up several missed conversions]

Signed-off-by: Eric Sandeen
Reviewed-by: Brian Foster
Signed-off-by: Dave Chinner
Cc: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman

Eric Sandeen
2017-01-12 18:39:39 +0800
3978c5bb0 xfs: don't BUG() on mixed direct and mapped I/O ... Browse Code »

commit 04197b341f23b908193308b8d63d17ff23232598 upstream.

We've had reports of generic/095 causing XFS to BUG() in
__xfs_get_blocks() due to the existence of delalloc blocks on a
direct I/O read. generic/095 issues a mix of various types of I/O,
including direct and memory mapped I/O to a single file. This is
clearly not supported behavior and is known to lead to such
problems. E.g., the lack of exclusion between the direct I/O and
write fault paths means that a write fault can allocate delalloc
blocks in a region of a file that was previously a hole after the
direct read has attempted to flush/inval the file range, but before
it actually reads the block mapping. In turn, the direct read
discovers a delalloc extent and cannot proceed.

While the appropriate solution here is to not mix direct and memory
mapped I/O to the same regions of the same file, the current
BUG_ON() behavior is probably overkill as it can crash the entire
system. Instead, localize the failure to the I/O in question by
returning an error for a direct I/O that cannot be handled safely
due to delalloc blocks. Be careful to allow the case of a direct
write to post-eof delalloc blocks. This can occur due to speculative
preallocation and is safe as post-eof blocks are not accompanied by
dirty pages in pagecache (conversely, preallocation within eof must
have been zeroed, and thus dirtied, before the inode size could have
been increased beyond said blocks).

Finally, provide an additional warning if a direct I/O write occurs
while the file is memory mapped. This may not catch all problematic
scenarios, but provides a hint that some known-to-be-problematic I/O
methods are in use.

Signed-off-by: Brian Foster
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner
Cc: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman

Brian Foster
2017-01-12 18:39:39 +0800