Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

30 Sep, 2015

25 commits

164802e40 jbd2: avoid infinite loop when destroying aborted journal ... Browse Code »

commit 841df7df196237ea63233f0f9eaa41db53afd70f upstream.

Commit 6f6a6fda2945 "jbd2: fix ocfs2 corrupt when updating journal
superblock fails" changed jbd2_cleanup_journal_tail() to return EIO
when the journal is aborted. That makes logic in
jbd2_log_do_checkpoint() bail out which is fine, except that
jbd2_journal_destroy() expects jbd2_log_do_checkpoint() to always make
a progress in cleaning the journal. Without it jbd2_journal_destroy()
just loops in an infinite loop.

Fix jbd2_journal_destroy() to cleanup journal checkpoint lists of
jbd2_log_do_checkpoint() fails with error.

Reported-by: Eryu Guan
Tested-by: Eryu Guan
Fixes: 6f6a6fda294506dfe0e3e0a253bb2d2923f28f0a
Signed-off-by: Jan Kara
Signed-off-by: Theodore Ts'o
Signed-off-by: Greg Kroah-Hartman

Jan Kara
2015-09-30 01:26:19 +0800
d9c410f96 hfs,hfsplus: cache pages correctly between bnode_create and bnode_free ... Browse Code »

commit 7cb74be6fd827e314f81df3c5889b87e4c87c569 upstream.

Pages looked up by __hfs_bnode_create() (called by hfs_bnode_create() and
hfs_bnode_find() for finding or creating pages corresponding to an inode)
are immediately kmap()'ed and used (both read and write) and kunmap()'ed,
and should not be page_cache_release()'ed until hfs_bnode_free().

This patch fixes a problem I first saw in July 2012: merely running "du"
on a large hfsplus-mounted directory a few times on a reasonably loaded
system would get the hfsplus driver all confused and complaining about
B-tree inconsistencies, and generates a "BUG: Bad page state". Most
recently, I can generate this problem on up-to-date Fedora 22 with shipped
kernel 4.0.5, by running "du /" (="/" + "/home" + "/mnt" + other smaller
mounts) and "du /mnt" simultaneously on two windows, where /mnt is a
lightly-used QEMU VM image of the full Mac OS X 10.9:

$ df -i / /home /mnt
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/mapper/fedora-root 3276800 551665 2725135 17% /
/dev/mapper/fedora-home 52879360 716221 52163139 2% /home
/dev/nbd0p2 4294967295 1387818 4293579477 1% /mnt

After applying the patch, I was able to run "du /" (60+ times) and "du
/mnt" (150+ times) continuously and simultaneously for 6+ hours.

There are many reports of the hfsplus driver getting confused under load
and generating "BUG: Bad page state" or other similar issues over the
years. [1]

The unpatched code [2] has always been wrong since it entered the kernel
tree. The only reason why it gets away with it is that the
kmap/memcpy/kunmap follow very quickly after the page_cache_release() so
the kernel has not had a chance to reuse the memory for something else,
most of the time.

The current RW driver appears to have followed the design and development
of the earlier read-only hfsplus driver [3], where-by version 0.1 (Dec
2001) had a B-tree node-centric approach to
read_cache_page()/page_cache_release() per bnode_get()/bnode_put(),
migrating towards version 0.2 (June 2002) of caching and releasing pages
per inode extents. When the current RW code first entered the kernel [2]
in 2005, there was an REF_PAGES conditional (and "//" commented out code)
to switch between B-node centric paging to inode-centric paging. There
was a mistake with the direction of one of the REF_PAGES conditionals in
__hfs_bnode_create(). In a subsequent "remove debug code" commit [4], the
read_cache_page()/page_cache_release() per bnode_get()/bnode_put() were
removed, but a page_cache_release() was mistakenly left in (propagating
the "REF_PAGES !REF_PAGE" mistake), and the commented-out
page_cache_release() in bnode_release() (which should be spanned by
!REF_PAGES) was never enabled.

References:
[1]:
Michael Fox, Apr 2013
http://www.spinics.net/lists/linux-fsdevel/msg63807.html
("hfsplus volume suddenly inaccessable after 'hfs: recoff %d too large'")

Sasha Levin, Feb 2015
http://lkml.org/lkml/2015/2/20/85 ("use after free")

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/740814
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1027887
https://bugzilla.kernel.org/show_bug.cgi?id=42342
https://bugzilla.kernel.org/show_bug.cgi?id=63841
https://bugzilla.kernel.org/show_bug.cgi?id=78761

[2]:
http://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/\
fs/hfs/bnode.c?id=d1081202f1d0ee35ab0beb490da4b65d4bc763db
commit d1081202f1d0ee35ab0beb490da4b65d4bc763db
Author: Andrew Morton
Date: Wed Feb 25 16:17:36 2004 -0800

[PATCH] HFS rewrite

http://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/\
fs/hfsplus/bnode.c?id=91556682e0bf004d98a529bf829d339abb98bbbd

commit 91556682e0bf004d98a529bf829d339abb98bbbd
Author: Andrew Morton
Date: Wed Feb 25 16:17:48 2004 -0800

[PATCH] HFS+ support

[3]:
http://sourceforge.net/projects/linux-hfsplus/

http://sourceforge.net/projects/linux-hfsplus/files/Linux%202.4.x%20patch/hfsplus%200.1/
http://sourceforge.net/projects/linux-hfsplus/files/Linux%202.4.x%20patch/hfsplus%200.2/

http://linux-hfsplus.cvs.sourceforge.net/viewvc/linux-hfsplus/linux/\
fs/hfsplus/bnode.c?r1=1.4&r2=1.5

Date: Thu Jun 6 09:45:14 2002 +0000
Use buffer cache instead of page cache in bnode.c. Cache inode extents.

[4]:
http://git.kernel.org/cgit/linux/kernel/git/\
stable/linux-stable.git/commit/?id=a5e3985fa014029eb6795664c704953720cc7f7d

commit a5e3985fa014029eb6795664c704953720cc7f7d
Author: Roman Zippel
Date: Tue Sep 6 15:18:47 2005 -0700

[PATCH] hfs: remove debug code

Signed-off-by: Hin-Tak Leung
Signed-off-by: Sergei Antonov
Reviewed-by: Anton Altaparmakov
Reported-by: Sasha Levin
Cc: Al Viro
Cc: Christoph Hellwig
Cc: Vyacheslav Dubeyko
Cc: Sougata Santra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Hin-Tak Leung
2015-09-30 01:26:18 +0800
919a78e3e hfs: fix B-tree corruption after insertion at position 0 ... Browse Code »

commit b4cc0efea4f0bfa2477c56af406cfcf3d3e58680 upstream.

Fix B-tree corruption when a new record is inserted at position 0 in the
node in hfs_brec_insert().

This is an identical change to the corresponding hfs b-tree code to Sergei
Antonov's "hfsplus: fix B-tree corruption after insertion at position 0",
to keep similar code paths in the hfs and hfsplus drivers in sync, where
appropriate.

Signed-off-by: Hin-Tak Leung
Cc: Sergei Antonov
Cc: Joe Perches
Reviewed-by: Vyacheslav Dubeyko
Cc: Anton Altaparmakov
Cc: Al Viro
Cc: Christoph Hellwig
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Hin-Tak Leung
2015-09-30 01:26:15 +0800
22fda415e eCryptfs: Invalidate dcache entries when lower i_nlink is zero ... Browse Code »

commit 5556e7e6d30e8e9b5ee51b0e5edd526ee80e5e36 upstream.

Consider eCryptfs dcache entries to be stale when the corresponding
lower inode's i_nlink count is zero. This solves a problem caused by the
lower inode being directly modified, without going through the eCryptfs
mount, leaving stale eCryptfs dentries cached and the eCryptfs inode's
i_nlink count not being cleared.

Signed-off-by: Tyler Hicks
Reported-by: Richard Weinberger
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman

Tyler Hicks
2015-09-30 01:26:15 +0800
2be9c8262 fs: Don't dump core if the corefile would become world-readable. ... Browse Code »

commit 40f705a736eac10e7dca7ab5dd5ed675a6df031d upstream.

On a filesystem like vfat, all files are created with the same owner
and mode independent of who created the file. When a vfat filesystem
is mounted with root as owner of all files and read access for everyone,
root's processes left world-readable coredumps on it (but other
users' processes only left empty corefiles when given write access
because of the uid mismatch).

Given that the old behavior was inconsistent and insecure, I don't see
a problem with changing it. Now, all processes refuse to dump core unless
the resulting corefile will only be readable by their owner.

Signed-off-by: Jann Horn
Acked-by: Kees Cook
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Jann Horn
2015-09-30 01:26:12 +0800
244d3c13d fs: if a coredump already exists, unlink and recreate with O_EXCL ... Browse Code »

commit fbb1816942c04429e85dbf4c1a080accc534299e upstream.

It was possible for an attacking user to trick root (or another user) into
writing his coredumps into an attacker-readable, pre-existing file using
rename() or link(), causing the disclosure of secret data from the victim
process' virtual memory. Depending on the configuration, it was also
possible to trick root into overwriting system files with coredumps. Fix
that issue by never writing coredumps into existing files.

Requirements for the attack:
- The attack only applies if the victim's process has a nonzero
RLIMIT_CORE and is dumpable.
- The attacker can trick the victim into coredumping into an
attacker-writable directory D, either because the core_pattern is
relative and the victim's cwd is attacker-writable or because an
absolute core_pattern pointing to a world-writable directory is used.
- The attacker has one of these:
A: on a system with protected_hardlinks=0:
execute access to a folder containing a victim-owned,
attacker-readable file on the same partition as D, and the
victim-owned file will be deleted before the main part of the attack
takes place. (In practice, there are lots of files that fulfill
this condition, e.g. entries in Debian's /var/lib/dpkg/info/.)
This does not apply to most Linux systems because most distros set
protected_hardlinks=1.
B: on a system with protected_hardlinks=1:
execute access to a folder containing a victim-owned,
attacker-readable and attacker-writable file on the same partition
as D, and the victim-owned file will be deleted before the main part
of the attack takes place.
(This seems to be uncommon.)
C: on any system, independent of protected_hardlinks:
write access to a non-sticky folder containing a victim-owned,
attacker-readable file on the same partition as D
(This seems to be uncommon.)

The basic idea is that the attacker moves the victim-owned file to where
he expects the victim process to dump its core. The victim process dumps
its core into the existing file, and the attacker reads the coredump from
it.

If the attacker can't move the file because he does not have write access
to the containing directory, he can instead link the file to a directory
he controls, then wait for the original link to the file to be deleted
(because the kernel checks that the link count of the corefile is 1).

A less reliable variant that requires D to be non-sticky works with link()
and does not require deletion of the original link: link() the file into
D, but then unlink() it directly before the kernel performs the link count
check.

On systems with protected_hardlinks=0, this variant allows an attacker to
not only gain information from coredumps, but also clobber existing,
victim-writable files with coredumps. (This could theoretically lead to a
privilege escalation.)

Signed-off-by: Jann Horn
Cc: Kees Cook
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Jann Horn
2015-09-30 01:26:12 +0800
d40d9de9d Revert "NFSv4: Remove incorrect check in can_open_delegated()" ... Browse Code »

commit 36319608e28701c07cad80ae3be8b0fdfb1ab40f upstream.

This reverts commit 4e379d36c050b0117b5d10048be63a44f5036115.

This commit opens up a race between the recovery code and the open code.

Reported-by: Olga Kornievskaia
Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Trond Myklebust
2015-09-30 01:26:10 +0800
3d5c6b90e NFSv4.1: Fix a protocol issue with CLOSE stateids ... Browse Code »

commit 4a1e2feb9d246775dee0f78ed5b18826bae2b1c5 upstream.

According to RFC5661 Section 18.2.4, CLOSE is supposed to return
the zero stateid. This means that nfs_clear_open_stateid_locked()
cannot assume that the result stateid will always match the 'other'
field of the existing open stateid when trying to determine a race
with a parallel OPEN.

Instead, we look at the argument, and check for matches.

Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Trond Myklebust
2015-09-30 01:26:10 +0800
f6384199b NFSv4.1/flexfiles: Fix a protocol error in layoutreturn ... Browse Code »

commit d13549074cf066d6d5bb29903d044beffea342d3 upstream.

According to the flexfiles protocol, the layoutreturn should specify an
array of errors in the following format:

struct ff_ioerr4 {
offset4 ffie_offset;
length4 ffie_length;
stateid4 ffie_stateid;
device_error4 ffie_errors<>;
};

This patch fixes up the code to ensure that our ffie_errors is indeed
encoded as an array (albeit with only a single entry).

Reported-by: Tom Haynes
Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Trond Myklebust
2015-09-30 01:26:10 +0800
b5a6dec79 NFS41/flexfiles: zero out DS write wcc ... Browse Code »

commit 5420401079e152ff68a8024f6a375804b1c21505 upstream.

We do not want to update inode attributes with DS values.

Signed-off-by: Peng Tao
Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Peng Tao
2015-09-30 01:26:10 +0800
66bfdda4b NFSv4: Force a post-op attribute update when holding a delegation ... Browse Code »

commit aaae3f00d3f67f681a1f3cb7af999e976e8a24ce upstream.

If the ctime or mtime or change attribute have changed because
of an operation we initiated, we should make sure that we force
an attribute update. However we do not want to mark the page cache
for revalidation.

Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Trond Myklebust
2015-09-30 01:26:09 +0800
73e8e7b2b NFS41/flexfiles: update inode after write finishes ... Browse Code »

commit 69f230d907e8c1ca3f9bd528993eeb98f712b0dd upstream.

Otherwise we break fstest case tests/read_write/mctime.t

Does files layout need the same fix as well?

Cc: Anna Schumaker
Signed-off-by: Peng Tao
Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Peng Tao
2015-09-30 01:26:09 +0800
8d1920be7 NFS: nfs_set_pgio_error sometimes misses errors ... Browse Code »

commit e9ae58aeee8842a50f7e199d602a5ccb2e41a95f upstream.

We should ensure that we always set the pgio_header's error field
if a READ or WRITE RPC call returns an error. The current code depends
on 'hdr->good_bytes' always being initialised to a large value, which
is not always done correctly by callers.
When this happens, applications may end up missing important errors.

Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Trond Myklebust
2015-09-30 01:26:09 +0800
87fbed414 NFS: Fix a NULL pointer dereference of migration recovery ops for v4.2 client ... Browse Code »

commit 18e3b739fdc826481c6a1335ce0c5b19b3d415da upstream.

---Steps to Reproduce--

# cat /etc/exports
/nfs/referal *(rw,insecure,no_subtree_check,no_root_squash,crossmnt)
/nfs/old *(ro,insecure,subtree_check,root_squash,crossmnt)

# mount -t nfs nfs-server:/nfs/ /mnt/
# ll /mnt/*/

# cat /etc/exports
/nfs/referal *(rw,insecure,no_subtree_check,no_root_squash,crossmnt,refer=/nfs/old/@nfs-server)
/nfs/old *(ro,insecure,subtree_check,root_squash,crossmnt)
# service nfs restart

# ll /mnt/*/ --->>>>> oops here

[ 5123.102925] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 5123.103363] IP: [] nfs4_proc_get_locations+0x9b/0x120 [nfsv4]
[ 5123.103752] PGD 587b9067 PUD 3cbf5067 PMD 0
[ 5123.104131] Oops: 0000 [#1]
[ 5123.104529] Modules linked in: nfsv4(OE) nfs(OE) fscache(E) nfsd(OE) xfs libcrc32c iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi coretemp crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel ppdev vmw_balloon parport_pc parport i2c_piix4 shpchp auth_rpcgss nfs_acl vmw_vmci lockd grace sunrpc vmwgfx drm_kms_helper ttm drm mptspi serio_raw scsi_transport_spi e1000 mptscsih mptbase ata_generic pata_acpi [last unloaded: nfsd]
[ 5123.105887] CPU: 0 PID: 15853 Comm: ::1-manager Tainted: G OE 4.2.0-rc6+ #214
[ 5123.106358] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/20/2014
[ 5123.106860] task: ffff88007620f300 ti: ffff88005877c000 task.ti: ffff88005877c000
[ 5123.107363] RIP: 0010:[] [] nfs4_proc_get_locations+0x9b/0x120 [nfsv4]
[ 5123.107909] RSP: 0018:ffff88005877fdb8 EFLAGS: 00010246
[ 5123.108435] RAX: ffff880053f3bc00 RBX: ffff88006ce6c908 RCX: ffff880053a0d240
[ 5123.108968] RDX: ffffea0000e6d940 RSI: ffff8800399a0000 RDI: ffff88006ce6c908
[ 5123.109503] RBP: ffff88005877fe28 R08: ffffffff81c708a0 R09: 0000000000000000
[ 5123.110045] R10: 00000000000001a2 R11: ffff88003ba7f5c8 R12: ffff880054c55800
[ 5123.110618] R13: 0000000000000000 R14: ffff880053a0d240 R15: ffff880053a0d240
[ 5123.111169] FS: 0000000000000000(0000) GS:ffffffff81c27000(0000) knlGS:0000000000000000
[ 5123.111726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5123.112286] CR2: 0000000000000000 CR3: 0000000054cac000 CR4: 00000000001406f0
[ 5123.112888] Stack:
[ 5123.113458] ffffea0000e6d940 ffff8800399a0000 00000000000167d0 0000000000000000
[ 5123.114049] 0000000000000000 0000000000000000 0000000000000000 00000000a7ec82c6
[ 5123.114662] ffff88005877fe18 ffffea0000e6d940 ffff8800399a0000 ffff880054c55800
[ 5123.115264] Call Trace:
[ 5123.115868] [] nfs4_try_migration+0xbb/0x220 [nfsv4]
[ 5123.116487] [] nfs4_run_state_manager+0x4ab/0x7b0 [nfsv4]
[ 5123.117104] [] ? nfs4_do_reclaim+0x510/0x510 [nfsv4]
[ 5123.117813] [] kthread+0xd7/0xf0
[ 5123.118456] [] ? kthread_worker_fn+0x160/0x160
[ 5123.119108] [] ret_from_fork+0x3f/0x70
[ 5123.119723] [] ? kthread_worker_fn+0x160/0x160
[ 5123.120329] Code: 4c 8b 6a 58 74 17 eb 52 48 8d 55 a8 89 c6 4c 89 e7 e8 4a b5 ff ff 8b 45 b0 85 c0 74 1c 4c 89 f9 48 8b 55 90 48 8b 75 98 48 89 df ff 55 00 3d e8 d8 ff ff 41 89 c6 74 cf 48 8b 4d c8 65 48 33
[ 5123.121643] RIP [] nfs4_proc_get_locations+0x9b/0x120 [nfsv4]
[ 5123.122308] RSP
[ 5123.122942] CR2: 0000000000000000

Fixes: ec011fe847 ("NFS: Introduce a vector of migration recovery ops")
Signed-off-by: Kinglong Mee
Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Kinglong Mee
2015-09-30 01:26:09 +0800
0bdce6a85 NFSv4.1/pNFS: Fix borken function _same_data_server_addrs_locked() ... Browse Code »

commit 6f536936b79bd4b5cea8fb0e5b8b0bce8cd1ea4a upstream.

- Switch back to using list_for_each_entry(). Fixes an incorrect test
for list NULL termination.
- Do not assume that lists are sorted.
- Finally, consider an existing entry to match if it consists of a subset
of the addresses in the new entry.

Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Trond Myklebust
2015-09-30 01:26:08 +0800
e204275a0 NFS: Don't let the ctime override attribute barriers. ... Browse Code »

commit 7c2dad99d60c86ec686b3bfdcb787c450a7ea89f upstream.

Chuck reports seeing cases where a GETATTR that happens to race
with an asynchronous WRITE is overriding the file size, despite
the attribute barrier being set by the writeback code.

The culprit turns out to be the check in nfs_ctime_need_update(),
which sees that the ctime is newer than the cached ctime, and
assumes that it is safe to override the attribute barrier.
This patch removes that override, and ensures that attribute
barriers are always respected.

Reported-by: Chuck Lever
Fixes: a08a8cd375db9 ("NFS: Add attribute update barriers to NFS writebacks")
Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Trond Myklebust
2015-09-30 01:26:08 +0800
7bc97ee9c NFSv4: don't set SETATTR for O_RDONLY|O_EXCL ... Browse Code »

commit efcbc04e16dfa95fef76309f89710dd1d99a5453 upstream.

It is unusual to combine the open flags O_RDONLY and O_EXCL, but
it appears that libre-office does just that.

[pid 3250] stat("/home/USER/.config", {st_mode=S_IFDIR|0700, st_size=8192, ...}) = 0
[pid 3250] open("/home/USER/.config/libreoffice/4-suse/user/extensions/buildid", O_RDONLY|O_EXCL

NFSv4 takes O_EXCL as a sign that a setattr command should be sent,
probably to reset the timestamps.

When it was an O_RDONLY open, the SETATTR command does not
identify any actual attributes to change.
If no delegation was provided to the open, the SETATTR uses the
all-zeros stateid and the request is accepted (at least by the
Linux NFS server - no harm, no foul).

If a read-delegation was provided, this is used in the SETATTR
request, and a Netapp filer will justifiably claim
NFS4ERR_BAD_STATEID, which the Linux client takes as a sign
to retry - indefinitely.

So only treat O_EXCL specially if O_CREAT was also given.

Signed-off-by: NeilBrown
Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

NeilBrown
2015-09-30 01:26:08 +0800
ea056d3d5 nfsd: ensure that delegation stateid hash references are only put once ... Browse Code »

commit 3fcbbd244ed1d20dc0eb7d48d729503992fa9b7d upstream.

It's possible that a DELEGRETURN could race with (e.g.) client expiry,
in which case we could end up putting the delegation hash reference more
than once.

Have unhash_delegation_locked return a bool that indicates whether it
was already unhashed. In the case of destroy_delegation we only
conditionally put the hash reference if that returns true.

The other callers of unhash_delegation_locked call it while walking
list_heads that shouldn't yet be detached. If we find that it doesn't
return true in those cases, then throw a WARN_ON as that indicates that
we have a partially hashed delegation, and that something is likely very
wrong.

Tested-by: Andrew W Elble
Tested-by: Anna Schumaker
Signed-off-by: Jeff Layton
Signed-off-by: J. Bruce Fields
Signed-off-by: Greg Kroah-Hartman

Jeff Layton
2015-09-30 01:26:08 +0800
0940ed480 nfsd: ensure that the ol stateid hash reference is only put once ... Browse Code »

commit e85687393f3ee0a77ccca016f903d1558bb69258 upstream.

When an open or lock stateid is hashed, we take an extra reference to
it. When we unhash it, we drop that reference. The code however does
not properly account for the case where we have two callers concurrently
trying to unhash the stateid. This can lead to list corruption and the
hash reference being put more than once.

Fix this by having unhash_ol_stateid use list_del_init on the st_perfile
list_head, and then testing to see if that list_head is empty before
releasing the hash reference. This means that some of the unhashing
wrappers now become bool return functions so we can test to see whether
the stateid was unhashed before we put the reference.

Reported-by: Andrew W Elble
Tested-by: Andrew W Elble
Reported-by: Anna Schumaker
Tested-by: Anna Schumaker
Signed-off-by: Jeff Layton
Signed-off-by: J. Bruce Fields
Signed-off-by: Greg Kroah-Hartman

Jeff Layton
2015-09-30 01:26:08 +0800
8751006ab nfsd: Fix an FS_LAYOUT_TYPES/LAYOUT_TYPES encode bug ... Browse Code »

commit 6896f15aabde505b35888039af93d1d182a0108a upstream.

Currently we'll respond correctly to a request for either
FS_LAYOUT_TYPES or LAYOUT_TYPES, but not to a request for both
attributes simultaneously.

Signed-off-by: Kinglong Mee
Reviewed-by: Christoph Hellwig
Signed-off-by: J. Bruce Fields
Signed-off-by: Greg Kroah-Hartman

Kinglong Mee
2015-09-30 01:26:07 +0800
9b6d61edd NFSv4/pnfs: Ensure we don't miss a file extension ... Browse Code »

commit 2b83d3de4c18af49800e0b26ae013db4fcf43a4a upstream.

pNFS writes don't return attributes, however that doesn't mean that we
should ignore the fact that they may be extending the file. This patch
ensures that if a write is seen to extend the file, then we always set
an attribute barrier, and update the cached file size.

Signed-off-by: Trond Myklebust
Cc: Peng Tao
Signed-off-by: Greg Kroah-Hartman

Trond Myklebust
2015-09-30 01:26:07 +0800
fe9956354 Btrfs: check if previous transaction aborted to avoid fs corruption ... Browse Code »

commit 1f9b8c8fbc9a4d029760b16f477b9d15500e3a34 upstream.

While we are committing a transaction, it's possible the previous one is
still finishing its commit and therefore we wait for it to finish first.
However we were not checking if that previous transaction ended up getting
aborted after we waited for it to commit, so we ended up committing the
current transaction which can lead to fs corruption because the new
superblock can point to trees that have had one or more nodes/leafs that
were never durably persisted.
The following sequence diagram exemplifies how this is possible:

CPU 0 CPU 1

transaction N starts

(...)

btrfs_commit_transaction(N)

cur_trans->state = TRANS_STATE_COMMIT_START;
(...)
cur_trans->state = TRANS_STATE_COMMIT_DOING;
(...)

cur_trans->state = TRANS_STATE_UNBLOCKED;
root->fs_info->running_transaction = NULL;

btrfs_start_transaction()
--> starts transaction N + 1

btrfs_write_and_wait_transaction(trans, root);
--> starts writing all new or COWed ebs created
at transaction N

creates some new ebs, COWs some
existing ebs but doesn't COW or
deletes eb X

btrfs_commit_transaction(N + 1)
(...)
cur_trans->state = TRANS_STATE_COMMIT_START;
(...)
wait_for_commit(root, prev_trans);
--> prev_trans == transaction N

btrfs_write_and_wait_transaction() continues
writing ebs
--> fails writing eb X, we abort transaction N
and set bit BTRFS_FS_STATE_ERROR on
fs_info->fs_state, so no new transactions
can start after setting that bit

cleanup_transaction()
btrfs_cleanup_one_transaction()
wakes up task at CPU 1

continues, doesn't abort because
cur_trans->aborted (transaction N + 1)
is zero, and no checks for bit
BTRFS_FS_STATE_ERROR in fs_info->fs_state
are made

btrfs_write_and_wait_transaction(trans, root);
--> succeeds, no errors during writeback

write_ctree_super(trans, root, 0);
--> succeeds
--> we have now a superblock that points us
to some root that uses eb X, which was
never written to disk

In this scenario future attempts to read eb X from disk results in an
error message like "parent transid verify failed on X wanted Y found Z".

So fix this by aborting the current transaction if after waiting for the
previous transaction we verify that it was aborted.

Signed-off-by: Filipe Manana
Reviewed-by: Josef Bacik
Reviewed-by: Liu Bo
Signed-off-by: Chris Mason
Signed-off-by: Greg Kroah-Hartman

Filipe Manana
2015-09-30 01:26:07 +0800
c1589311c CIFS: fix type confusion in copy offload ioctl ... Browse Code »

commit 4c17a6d56bb0cad3066a714e94f7185a24b40f49 upstream.

This might lead to local privilege escalation (code execution as
kernel) for systems where the following conditions are met:

- CONFIG_CIFS_SMB2 and CONFIG_CIFS_POSIX are enabled
- a cifs filesystem is mounted where:
- the mount option "vers" was used and set to a value >=2.0
- the attacker has write access to at least one file on the filesystem

To attack this, an attacker would have to guess the target_tcon
pointer (but guessing wrong doesn't cause a crash, it just returns an
error code) and win a narrow race.

Signed-off-by: Jann Horn
Signed-off-by: Steve French
Signed-off-by: Greg Kroah-Hartman

Jann Horn
2015-09-30 01:26:06 +0800
2a6f41747 Revert "ext4: remove block_device_ejected" ... Browse Code »

commit bdfe0cbd746aa9b2509c2f6d6be17193cf7facd7 upstream.

This reverts commit 08439fec266c3cc5702953b4f54bdf5649357de0.

Unfortunately we still need to test for bdi->dev to avoid a crash when a
USB stick is yanked out while a file system is mounted:

usb 2-2: USB disconnect, device number 2
Buffer I/O error on dev sdb1, logical block 15237120, lost sync page write
JBD2: Error -5 detected when updating journal superblock for sdb1-8.
BUG: unable to handle kernel paging request at 34beb000
IP: [] __percpu_counter_add+0x18/0xc0
*pdpt = 0000000023db9001 *pde = 0000000000000000
Oops: 0000 [#1] SMP
CPU: 0 PID: 4083 Comm: umount Tainted: G U OE 4.1.1-040101-generic #201507011435
Hardware name: LENOVO 7675CTO/7675CTO, BIOS 7NETC2WW (2.22 ) 03/22/2011
task: ebf06b50 ti: ebebc000 task.ti: ebebc000
EIP: 0060:[] EFLAGS: 00010082 CPU: 0
EIP is at __percpu_counter_add+0x18/0xc0
EAX: f21c8e88 EBX: f21c8e88 ECX: 00000000 EDX: 00000001
ESI: 00000001 EDI: 00000000 EBP: ebebde60 ESP: ebebde40
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
CR0: 8005003b CR2: 34beb000 CR3: 33354200 CR4: 000007f0
Stack:
c1abe100 edcb0098 edcb00ec ffffffff f21c8e68 ffffffff f21c8e68 f286d160
ebebde84 c1160454 00000010 00000282 f72a77f8 00000984 f72a77f8 f286d160
f286d170 ebebdea0 c11e613f 00000000 00000282 f72a77f8 edd7f4d0 00000000
Call Trace:
[] account_page_dirtied+0x74/0x110
[] __set_page_dirty+0x3f/0xb0
[] mark_buffer_dirty+0x53/0xc0
[] ext4_commit_super+0x17b/0x250
[] ext4_put_super+0xc1/0x320
[] ? fsnotify_unmount_inodes+0x1aa/0x1c0
[] ? evict_inodes+0xca/0xe0
[] generic_shutdown_super+0x6a/0xe0
[] ? prepare_to_wait_event+0xd0/0xd0
[] ? unregister_shrinker+0x40/0x50
[] kill_block_super+0x26/0x70
[] deactivate_locked_super+0x45/0x80
[] deactivate_super+0x47/0x60
[] cleanup_mnt+0x39/0x80
[] __cleanup_mnt+0x10/0x20
[] task_work_run+0x91/0xd0
[] do_notify_resume+0x7c/0x90
[] work_notify
Code: 8b 55 e8 e9 f4 fe ff ff 90 90 90 90 90 90 90 90 90 90 90 55 89 e5 83 ec 20 89 5d f4 89 c3 89 75 f8 89 d6 89 7d fc 89 cf 8b 48 14 8b 01 89 45 ec 89 c2 8b 45 08 c1 fa 1f 01 75 ec 89 55 f0 89
EIP: [] __percpu_counter_add+0x18/0xc0 SS:ESP 0068:ebebde40
CR2: 0000000034beb000
---[ end trace dd564a7bea834ecd ]---

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=101011

Signed-off-by: Theodore Ts'o
Signed-off-by: Greg Kroah-Hartman

Theodore Ts'o
2015-09-30 01:25:58 +0800
c4f568f41 ext4: don't manipulate recovery flag when freezing no-journal fs ... Browse Code »

commit c642dc9e1aaed953597e7092d7df329e6234096e upstream.

At some point along this sequence of changes:

f6e63f9 ext4: fold ext4_nojournal_sops into ext4_sops
bb04457 ext4: support freezing ext2 (nojournal) file systems
9ca9238 ext4: Use separate super_operations structure for no_journal filesystems

ext4 started setting needs_recovery on filesystems without journals
when they are unfrozen. This makes no sense, and in fact confuses
blkid to the point where it doesn't recognize the filesystem at all.

(freeze ext2; unfreeze ext2; run blkid; see no output; run dumpe2fs,
see needs_recovery set on fs w/ no journal).

To fix this, don't manipulate the INCOMPAT_RECOVER feature on
filesystems without journals.

Reported-by: Stu Mark
Reviewed-by: Jan Kara
Signed-off-by: Eric Sandeen
Signed-off-by: Theodore Ts'o
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman

Eric Sandeen
2015-09-30 01:25:58 +0800

22 Sep, 2015

6 commits

d3b428f03 fs: create and use seq_show_option for escaping ... Browse Code »

commit a068acf2ee77693e0bf39d6e07139ba704f461c3 upstream.

Many file systems that implement the show_options hook fail to correctly
escape their output which could lead to unescaped characters (e.g. new
lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files. This
could lead to confusion, spoofed entries (resulting in things like
systemd issuing false d-bus "mount" notifications), and who knows what
else. This looks like it would only be the root user stepping on
themselves, but it's possible weird things could happen in containers or
in other situations with delegated mount privileges.

Here's an example using overlay with setuid fusermount trusting the
contents of /proc/mounts (via the /etc/mtab symlink). Imagine the use
of "sudo" is something more sneaky:

$ BASE="ovl"
$ MNT="$BASE/mnt"
$ LOW="$BASE/lower"
$ UP="$BASE/upper"
$ WORK="$BASE/work/ 0 0
none /proc fuse.pwn user_id=1000"
$ mkdir -p "$LOW" "$UP" "$WORK"
$ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt
$ cat /proc/mounts
none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0
none /proc fuse.pwn user_id=1000 0 0
$ fusermount -u /proc
$ cat /proc/mounts
cat: /proc/mounts: No such file or directory

This fixes the problem by adding new seq_show_option and
seq_show_option_n helpers, and updating the vulnerable show_option
handlers to use them as needed. Some, like SELinux, need to be open
coded due to unusual existing escape mechanisms.

[akpm@linux-foundation.org: add lost chunk, per Kees]
[keescook@chromium.org: seq_show_option should be using const parameters]
Signed-off-by: Kees Cook
Acked-by: Serge Hallyn
Acked-by: Jan Kara
Acked-by: Paul Moore
Cc: J. R. Okajima
Signed-off-by: Kees Cook
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Kees Cook
2015-09-22 01:05:45 +0800
213abaf95 hpfs: update ctime and mtime on directory modification ... Browse Code »

commit f49a26e7718dd30b49e3541e3e25aecf5e7294e2 upstream.

Update ctime and mtime when a directory is modified. (though OS/2 doesn't
update them anyway)

Signed-off-by: Mikulas Patocka
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Mikulas Patocka
2015-09-22 01:05:44 +0800
16e327d31 fs: Set the size of empty dirs to 0. ... Browse Code »

commit 4b75de8615050c1b0dd8d7794838c42f74ed36ba upstream.

Before the make_empty_dir_inode calls were introduce into proc, sysfs,
and sysctl those directories when stated reported an i_size of 0.
make_empty_dir_inode started reporting an i_size of 2. At least one
userspace application depended on stat returning i_size of 0. So
modify make_empty_dir_inode to cause an i_size of 0 to be reported for
these directories.

Reported-by: Tejun Heo
Acked-by: Tejun Heo
Signed-off-by: "Eric W. Biederman"
Signed-off-by: Greg Kroah-Hartman

Eric W. Biederman
2015-09-22 01:05:44 +0800
533f26776 xfs: Fix file type directory corruption for btree directories ... Browse Code »

commit 037542345a82aaaa228ec280fe6ddff1568d169f upstream.

Users have occasionally reported that file type for some directory
entries is wrong. This mostly happened after updating libraries some
libraries. After some debugging the problem was traced down to
xfs_dir2_node_replace(). The function uses args->filetype as a file type
to store in the replaced directory entry however it also calls
xfs_da3_node_lookup_int() which will store file type of the current
directory entry in args->filetype. Thus we fail to change file type of a
directory entry to a proper type.

Fix the problem by storing new file type in a local variable before
calling xfs_da3_node_lookup_int().

Reported-by: Giacomo Comes
Signed-off-by: Jan Kara
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner
Signed-off-by: Greg Kroah-Hartman

Jan Kara
2015-09-22 01:05:37 +0800
395160167 xfs: Fix xfs_attr_leafblock definition ... Browse Code »

commit ffeecc5213024ae663377b442eedcfbacf6d0c5d upstream.

struct xfs_attr_leafblock contains 'entries' array which is declared
with size 1 altough it can in fact contain much more entries. Since this
array is followed by further struct members, gcc (at least in version
4.8.3) thinks that the array has the fixed size of 1 element and thus
may optimize away all accesses beyond the end of array resulting in
non-working code. This problem was only observed with userspace code in
xfsprogs, however it's better to be safe in kernel as well and have
matching kernel and xfsprogs definitions.

Signed-off-by: Jan Kara
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner
Signed-off-by: Greg Kroah-Hartman

Jan Kara
2015-09-22 01:05:37 +0800
16f47b6ab libxfs: readahead of dir3 data blocks should use the read verifier ... Browse Code »

commit 2f123bce18943fff819bc10f8868ffb9149fc622 upstream.

In the dir3 data block readahead function, use the regular read
verifier to check the block's CRC and spot-check the block contents
instead of directly calling only the spot-checking routine. This
prevents corrupted directory data blocks from being read into the
kernel, which can lead to garbage ls output and directory loops (if
say one of the entries contains slashes and other junk).

Signed-off-by: Darrick J. Wong
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner
Signed-off-by: Greg Kroah-Hartman

Darrick J. Wong
2015-09-22 01:05:36 +0800

17 Aug, 2015

7 commits

1ccdd6c6e nfsd: do nfs4_check_fh in nfs4_check_file instead of nfs4_check_olstateid ... Browse Code »

commit 8fcd461db7c09337b6d2e22d25eb411123f379e3 upstream.

Currently, preprocess_stateid_op calls nfs4_check_olstateid which
verifies that the open stateid corresponds to the current filehandle in the
call by calling nfs4_check_fh.

If the stateid is a NFS4_DELEG_STID however, then no such check is done.
This could cause incorrect enforcement of permissions, because the
nfsd_permission() call in nfs4_check_file uses current the current
filehandle, but any subsequent IO operation will use the file descriptor
in the stateid.

Move the call to nfs4_check_fh into nfs4_check_file instead so that it
can be done for all stateid types.

Signed-off-by: Jeff Layton
[bfields: moved fh check to avoid NULL deref in special stateid case]
Signed-off-by: J. Bruce Fields
Signed-off-by: Greg Kroah-Hartman

Jeff Layton
2015-08-17 11:52:27 +0800
3b5c2aed0 nfsd: refactor nfs4_preprocess_stateid_op ... Browse Code »

commit a0649b2d3fffb1cde8745568c767f3a55a3462bc upstream.

Split out two self contained helpers to make the function more readable.

Signed-off-by: Christoph Hellwig
Signed-off-by: J. Bruce Fields
Cc: Jeff Layton
Signed-off-by: Greg Kroah-Hartman

Christoph Hellwig
2015-08-17 11:52:26 +0800
8a97f0e58 signalfd: fix information leak in signalfd_copyinfo ... Browse Code »

commit 3ead7c52bdb0ab44f4bb1feed505a8323cc12ba7 upstream.

This function may copy the si_addr_lsb field to user mode when it hasn't
been initialized, which can leak kernel stack data to user mode.

Just checking the value of si_code is insufficient because the same
si_code value is shared between multiple signals. This is solved by
checking the value of si_signo in addition to si_code.

Signed-off-by: Amanieu d'Antras
Cc: Oleg Nesterov
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Amanieu d'Antras
2015-08-17 11:52:25 +0800
c7e6f0515 nfsd: Drop BUG_ON and ignore SECLABEL on absent filesystem ... Browse Code »

commit c2227a39a078473115910512aa0f8d53bd915e60 upstream.

On an absent filesystem (one served by another server), we need to be
able to handle requests for certain attributest (like fs_locations, so
the client can find out which server does have the filesystem), but
others we can't.

We forgot to take that into account when adding another attribute
bitmask work for the SECURITY_LABEL attribute.

There an export entry with the "refer" option can result in:

[ 88.414272] kernel BUG at fs/nfsd/nfs4xdr.c:2249!
[ 88.414828] invalid opcode: 0000 [#1] SMP
[ 88.415368] Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache nfsd xfs libcrc32c iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi iosf_mbi ppdev btrfs coretemp crct10dif_pclmul crc32_pclmul crc32c_intel xor ghash_clmulni_intel raid6_pq vmw_balloon parport_pc parport i2c_piix4 shpchp vmw_vmci acpi_cpufreq auth_rpcgss nfs_acl lockd grace sunrpc vmwgfx drm_kms_helper ttm drm mptspi mptscsih serio_raw mptbase e1000 scsi_transport_spi ata_generic pata_acpi [last unloaded: nfsd]
[ 88.417827] CPU: 0 PID: 2116 Comm: nfsd Not tainted 4.0.7-300.fc22.x86_64 #1
[ 88.418448] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/20/2014
[ 88.419093] task: ffff880079146d50 ti: ffff8800785d8000 task.ti: ffff8800785d8000
[ 88.419729] RIP: 0010:[] [] nfsd4_encode_fattr+0x820/0x1f00 [nfsd]
[ 88.420376] RSP: 0000:ffff8800785db998 EFLAGS: 00010206
[ 88.421027] RAX: 0000000000000001 RBX: 000000000018091a RCX: ffff88006668b980
[ 88.421676] RDX: 00000000fffef7fc RSI: 0000000000000000 RDI: ffff880078d05000
[ 88.422315] RBP: ffff8800785dbb58 R08: ffff880078d043f8 R09: ffff880078d4a000
[ 88.422968] R10: 0000000000010000 R11: 0000000000000002 R12: 0000000000b0a23a
[ 88.423612] R13: ffff880078d05000 R14: ffff880078683100 R15: ffff88006668b980
[ 88.424295] FS: 0000000000000000(0000) GS:ffff88007c600000(0000) knlGS:0000000000000000
[ 88.424944] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 88.425597] CR2: 00007f40bc370f90 CR3: 0000000035af5000 CR4: 00000000001407f0
[ 88.426285] Stack:
[ 88.426921] ffff8800785dbaa8 ffffffffa049e4af ffff8800785dba08 ffffffff813298f0
[ 88.427585] ffff880078683300 ffff8800769b0de8 0000089d00000001 0000000087f805e0
[ 88.428228] ffff880000000000 ffff880079434a00 0000000000000000 ffff88006668b980
[ 88.428877] Call Trace:
[ 88.429527] [] ? exp_get_by_name+0x7f/0xb0 [nfsd]
[ 88.430168] [] ? inode_doinit_with_dentry+0x210/0x6a0
[ 88.430807] [] ? d_lookup+0x2e/0x60
[ 88.431449] [] ? dput+0x33/0x230
[ 88.432097] [] ? mntput+0x24/0x40
[ 88.432719] [] ? path_put+0x22/0x30
[ 88.433340] [] ? nfsd_cross_mnt+0xb7/0x1c0 [nfsd]
[ 88.433954] [] nfsd4_encode_dirent+0x1b0/0x3d0 [nfsd]
[ 88.434601] [] ? nfsd4_encode_getattr+0x40/0x40 [nfsd]
[ 88.435172] [] nfsd_readdir+0x1c1/0x2a0 [nfsd]
[ 88.435710] [] ? nfsd_direct_splice_actor+0x20/0x20 [nfsd]
[ 88.436447] [] nfsd4_encode_readdir+0x120/0x220 [nfsd]
[ 88.437011] [] nfsd4_encode_operation+0x7d/0x190 [nfsd]
[ 88.437566] [] nfsd4_proc_compound+0x24d/0x6f0 [nfsd]
[ 88.438157] [] nfsd_dispatch+0xc3/0x220 [nfsd]
[ 88.438680] [] svc_process_common+0x43b/0x690 [sunrpc]
[ 88.439192] [] svc_process+0x103/0x1b0 [sunrpc]
[ 88.439694] [] nfsd+0x117/0x190 [nfsd]
[ 88.440194] [] ? nfsd_destroy+0x90/0x90 [nfsd]
[ 88.440697] [] kthread+0xd8/0xf0
[ 88.441260] [] ? kthread_worker_fn+0x180/0x180
[ 88.441762] [] ret_from_fork+0x58/0x90
[ 88.442322] [] ? kthread_worker_fn+0x180/0x180
[ 88.442879] Code: 0f 84 93 05 00 00 83 f8 ea c7 85 a0 fe ff ff 00 00 27 30 0f 84 ba fe ff ff 85 c0 0f 85 a5 fe ff ff e9 e3 f9 ff ff 0f 1f 44 00 00 0b 66 0f 1f 44 00 00 be 04 00 00 00 4c 89 ef 4c 89 8d 68 fe
[ 88.444052] RIP [] nfsd4_encode_fattr+0x820/0x1f00 [nfsd]
[ 88.444658] RSP
[ 88.445232] ---[ end trace 6cb9d0487d94a29f ]---

Signed-off-by: Kinglong Mee
Signed-off-by: J. Bruce Fields
Signed-off-by: Greg Kroah-Hartman

Kinglong Mee
2015-08-17 11:52:22 +0800
4b6bf170e ocfs2: fix shift left overflow ... Browse Code »

commit 32e5a2a2be6b085febaac36efff495ad65a55e6c upstream.

When using a large volume, for example 9T volume with 2T already used,
frequent creation of small files with O_DIRECT when the IO is not
cluster aligned may clear sectors in the wrong place. This will cause
filesystem corruption.

This is because p_cpos is a u32. When calculating the corresponding
sector it should be converted to u64 first, otherwise it may overflow.

Signed-off-by: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Joseph Qi
2015-08-17 11:52:22 +0800
c42af788a ocfs2: fix BUG in ocfs2_downconvert_thread_do_work() ... Browse Code »

commit 209f7512d007980fd111a74a064d70a3656079cf upstream.

The "BUG_ON(list_empty(&osb->blocked_lock_list))" in
ocfs2_downconvert_thread_do_work can be triggered in the following case:

ocfs2dc has firstly saved osb->blocked_lock_count to local varibale
processed, and then processes the dentry lockres. During the dentry
put, it calls iput and then deletes rw, inode and open lockres from
blocked list in ocfs2_mark_lockres_freeing. And this causes the
variable `processed' to not reflect the number of blocked lockres to be
processed, which triggers the BUG.

Signed-off-by: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Joseph Qi
2015-08-17 11:52:22 +0800
14e69b5a7 fsnotify: fix oops in fsnotify_clear_marks_by_group_flags() ... Browse Code »

commit 8f2f3eb59dff4ec538de55f2e0592fec85966aab upstream.

fsnotify_clear_marks_by_group_flags() can race with
fsnotify_destroy_marks() so that when fsnotify_destroy_mark_locked()
drops mark_mutex, a mark from the list iterated by
fsnotify_clear_marks_by_group_flags() can be freed and thus the next
entry pointer we have cached may become stale and we dereference free
memory.

Fix the problem by first moving marks to free to a special private list
and then always free the first entry in the special list. This method
is safe even when entries from the list can disappear once we drop the
lock.

Signed-off-by: Jan Kara
Reported-by: Ashish Sangwan
Reviewed-by: Ashish Sangwan
Cc: Lino Sanfilippo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Jan Kara
2015-08-17 11:52:09 +0800

11 Aug, 2015

2 commits

d00799670 xfs: remote attributes need to be considered data ... Browse Code »

commit df150ed102baa0e78c06e08e975dfb47147dd677 upstream.

We don't log remote attribute contents, and instead write them
synchronously before we commit the block allocation and attribute
tree update transaction. As a result we are writing to the allocated
space before the allcoation has been made permanent.

As a result, we cannot consider this allocation to be a metadata
allocation. Metadata allocation can take blocks from the free list
and so reuse them before the transaction that freed the block is
committed to disk. This behaviour is perfectly fine for journalled
metadata changes as log recovery will ensure the free operation is
replayed before the overwrite, but for remote attribute writes this
is not the case.

Hence we have to consider the remote attribute blocks to contain
data and allocate accordingly. We do this by dropping the
XFS_BMAPI_METADATA flag from the block allocation. This means the
allocation will not use blocks that are on the busy list without
first ensuring that the freeing transaction has been committed to
disk and the blocks removed from the busy list. This ensures we will
never overwrite a freed block without first ensuring that it is
really free.

Signed-off-by: Dave Chinner
Reviewed-by: Brian Foster
Signed-off-by: Dave Chinner
Signed-off-by: Greg Kroah-Hartman

Dave Chinner
2015-08-11 03:21:59 +0800
ff7f8c641 xfs: remote attribute headers contain an invalid LSN ... Browse Code »

commit e3c32ee9e3e747fec01eb38e6610a9157d44c3ea upstream.

In recent testing, a system that crashed failed log recovery on
restart with a bad symlink buffer magic number:

XFS (vda): Starting recovery (logdev: internal)
XFS (vda): Bad symlink block magic!
XFS: Assertion failed: 0, file: fs/xfs/xfs_log_recover.c, line: 2060

On examination of the log via xfs_logprint, none of the symlink
buffers in the log had a bad magic number, nor were any other types
of buffer log format headers mis-identified as symlink buffers.
Tracing was used to find the buffer the kernel was tripping over,
and xfs_db identified it's contents as:

000: 5841524d 00000000 00000346 64d82b48 8983e692 d71e4680 a5f49e2c b317576e
020: 00000000 00602038 00000000 006034ce d0020000 00000000 4d4d4d4d 4d4d4d4d
040: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
060: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
.....

This is a remote attribute buffer, which are notable in that they
are not logged but are instead written synchronously by the remote
attribute code so that they exist on disk before the attribute
transactions are committed to the journal.

The above remote attribute block has an invalid LSN in it - cycle
0xd002000, block 0 - which means when log recovery comes along to
determine if the transaction that writes to the underlying block
should be replayed, it sees a block that has a future LSN and so
does not replay the buffer data in the transaction. Instead, it
validates the buffer magic number and attaches the buffer verifier
to it. It is this buffer magic number check that is failing in the
above assert, indicating that we skipped replay due to the LSN of
the underlying buffer.

The problem here is that the remote attribute buffers cannot have a
valid LSN placed into them, because the transaction that contains
the attribute tree pointer changes and the block allocation that the
attribute data is being written to hasn't yet been committed. Hence
the LSN field in the attribute block is completely unwritten,
thereby leaving the underlying contents of the block in the LSN
field. It could have any value, and hence a future overwrite of the
block by log recovery may or may not work correctly.

Fix this by always writing an invalid LSN to the remote attribute
block, as any buffer in log recovery that needs to write over the
remote attribute should occur. We are protected from having old data
written over the attribute by the fact that freeing the block before
the remote attribute is written will result in the buffer being
marked stale in the log and so all changes prior to the buffer stale
transaction will be cancelled by log recovery.

Hence it is safe to ignore the LSN in the case or synchronously
written, unlogged metadata such as remote attribute blocks, and to
ensure we do that correctly, we need to write an invalid LSN to all
remote attribute blocks to trigger immediate recovery of metadata
that is written over the top.

As a further protection for filesystems that may already have remote
attribute blocks with bad LSNs on disk, change the log recovery code
to always trigger immediate recovery of metadata over remote
attribute blocks.

Signed-off-by: Dave Chinner
Reviewed-by: Brian Foster
Signed-off-by: Dave Chinner
Signed-off-by: Greg Kroah-Hartman

Dave Chinner
2015-08-11 03:21:59 +0800