Eric Lee / smarc-fsl-linux-kernel

29 May, 2016

10 commits

e0ab7af9b hash_string: Fix zero-length case for !DCACHE_WORD_ACCESS ... Browse Code »

The self-test was updated to cover zero-length strings; the function
needs to be updated, too.

Reported-by: Geert Uytterhoeven
Signed-off-by: George Spelvin
Fixes: fcfd2fbf22d2 ("fs/namei.c: Add hashlen_string() function")
Signed-off-by: Linus Torvalds

George Spelvin
2016-05-29 22:33:47 +0800
f2a031b66 Rename other copy of hash_string to hashlen_string ... Browse Code »

The original name was simply hash_string(), but that conflicted with a
function with that name in drivers/base/power/trace.c, and I decided
that calling it "hashlen_" was better anyway.

But you have to do it in two places.

[ This caused build errors for architectures that don't define
CONFIG_DCACHE_WORD_ACCESS - Linus ]

Signed-off-by: George Spelvin
Reported-by: Guenter Roeck
Fixes: fcfd2fbf22d2 ("fs/namei.c: Add hashlen_string() function")
Signed-off-by: Linus Torvalds

George Spelvin
2016-05-29 13:34:33 +0800
037369b87 hpfs: implement the show_options method ... Browse Code »

The HPFS filesystem used generic_show_options to produce string that is
displayed in /proc/mounts. However, there is a problem that the options
may disappear after remount. If we mount the filesystem with option1
and then remount it with option2, /proc/mounts should show both option1
and option2, however it only shows option2 because the whole option
string is replaced with replace_mount_options in hpfs_remount_fs.

To fix this bug, implement the hpfs_show_options function that prints
options that are currently selected.

Signed-off-by: Mikulas Patocka
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds

Mikulas Patocka
2016-05-29 07:50:24 +0800
01d6e0871 affs: fix remount failure when there are no options changed ... Browse Code »

Commit c8f33d0bec99 ("affs: kstrdup() memory handling") checks if the
kstrdup function returns NULL due to out-of-memory condition.

However, if we are remounting a filesystem with no change to
filesystem-specific options, the parameter data is NULL. In this case,
kstrdup returns NULL (because it was passed NULL parameter), although no
out of memory condition exists. The mount syscall then fails with
ENOMEM.

This patch fixes the bug. We fail with ENOMEM only if data is non-NULL.

The patch also changes the call to replace_mount_options - if we didn't
pass any filesystem-specific options, we don't call
replace_mount_options (thus we don't erase existing reported options).

Fixes: c8f33d0bec99 ("affs: kstrdup() memory handling")
Signed-off-by: Mikulas Patocka
Cc: stable@vger.kernel.org # v4.1+
Signed-off-by: Linus Torvalds

Mikulas Patocka
2016-05-29 07:50:24 +0800
44d51706b hpfs: fix remount failure when there are no options changed ... Browse Code »

Commit ce657611baf9 ("hpfs: kstrdup() out of memory handling") checks if
the kstrdup function returns NULL due to out-of-memory condition.

However, if we are remounting a filesystem with no change to
filesystem-specific options, the parameter data is NULL. In this case,
kstrdup returns NULL (because it was passed NULL parameter), although no
out of memory condition exists. The mount syscall then fails with
ENOMEM.

This patch fixes the bug. We fail with ENOMEM only if data is non-NULL.

The patch also changes the call to replace_mount_options - if we didn't
pass any filesystem-specific options, we don't call
replace_mount_options (thus we don't erase existing reported options).

Fixes: ce657611baf9 ("hpfs: kstrdup() out of memory handling")
Signed-off-by: Mikulas Patocka
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds

Mikulas Patocka
2016-05-29 07:50:24 +0800
d66492bce fs: fix binfmt_aout.c build error ... Browse Code »

Various builds (such as i386:allmodconfig) fail with

fs/binfmt_aout.c:133:2: error: expected identifier or '(' before 'return'
fs/binfmt_aout.c:134:1: error: expected identifier or '(' before '}' token

[ Oops. My bad, I had stupidly thought that "allmodconfig" covered this
on x86-64 too, but it obviously doesn't. Egg on my face. - Linus ]

Fixes: 5d22fc25d4fc ("mm: remove more IS_ERR_VALUE abuses")
Signed-off-by: Guenter Roeck
Signed-off-by: Linus Torvalds

Guenter Roeck
2016-05-29 07:34:59 +0800
7e0fb73c5 Merge branch 'hash' of git://ftp.sciencehorizons.net/linux ... Browse Code »

Pull string hash improvements from George Spelvin:
"This series does several related things:

- Makes the dcache hash (fs/namei.c) useful for general kernel use.

(Thanks to Bruce for noticing the zero-length corner case)

- Converts the string hashes in to use the
above.

- Avoids 64-bit multiplies in hash_64() on 32-bit platforms. Two
32-bit multiplies will do well enough.

- Rids the world of the bad hash multipliers in hash_32.

This finishes the job started in commit 689de1d6ca95 ("Minimal
fix-up of bad hashing behavior of hash_64()")

The vast majority of Linux architectures have hardware support for
32x32-bit multiply and so derive no benefit from "simplified"
multipliers.

The few processors that do not (68000, h8/300 and some models of
Microblaze) have arch-specific implementations added. Those
patches are last in the series.

- Overhauls the dcache hash mixing.

The patch in commit 0fed3ac866ea ("namei: Improve hash mixing if
CONFIG_DCACHE_WORD_ACCESS") was an off-the-cuff suggestion.
Replaced with a much more careful design that's simultaneously
faster and better. (My own invention, as there was noting suitable
in the literature I could find. Comments welcome!)

- Modify the hash_name() loop to skip the initial HASH_MIX(). This
would let us salt the hash if we ever wanted to.

- Sort out partial_name_hash().

The hash function is declared as using a long state, even though
it's truncated to 32 bits at the end and the extra internal state
contributes nothing to the result. And some callers do odd things:

- fs/hfs/string.c only allocates 32 bits of state
- fs/hfsplus/unicode.c uses it to hash 16-bit unicode symbols not bytes

- Modify bytemask_from_count to handle inputs of 1..sizeof(long)
rather than 0..sizeof(long)-1. This would simplify users other
than full_name_hash"

Special thanks to Bruce Fields for testing and finding bugs in v1. (I
learned some humbling lessons about "obviously correct" code.)

On the arch-specific front, the m68k assembly has been tested in a
standalone test harness, I've been in contact with the Microblaze
maintainers who mostly don't care, as the hardware multiplier is never
omitted in real-world applications, and I haven't heard anything from
the H8/300 world"

* 'hash' of git://ftp.sciencehorizons.net/linux:
h8300: Add
microblaze: Add
m68k: Add
: Add support for architecture-specific functions
fs/namei.c: Improve dcache hash function
Eliminate bad hash multipliers from hash_32() and hash_64()
Change hash_64() return value to 32 bits
: Define hash_str() in terms of hashlen_string()
fs/namei.c: Add hashlen_string() function
Pull out string hash to

Linus Torvalds
2016-05-29 07:15:25 +0800
468a94285 <linux/hash.h>: Add support for architecture-specific functions ... Browse Code »

This is just the infrastructure; there are no users yet.

This is modelled on CONFIG_ARCH_RANDOM; a CONFIG_ symbol declares
the existence of .

That file may define its own versions of various functions, and define
HAVE_* symbols (no CONFIG_ prefix!) to suppress the generic ones.

Included is a self-test (in lib/test_hash.c) that verifies the basics.
It is NOT in general required that the arch-specific functions compute
the same thing as the generic, but if a HAVE_* symbol is defined with
the value 1, then equality is tested.

Signed-off-by: George Spelvin
Cc: Geert Uytterhoeven
Cc: Greg Ungerer
Cc: Andreas Schwab
Cc: Philippe De Muyter
Cc: linux-m68k@lists.linux-m68k.org
Cc: Alistair Francis
Cc: Michal Simek
Cc: Yoshinori Sato
Cc: uclinux-h8-devel@lists.sourceforge.jp

George Spelvin
2016-05-29 03:48:31 +0800
2a18da7a9 fs/namei.c: Improve dcache hash function ... Browse Code »

Patch 0fed3ac866 improved the hash mixing, but the function is slower
than necessary; there's a 7-instruction dependency chain (10 on x86)
each loop iteration.

Word-at-a-time access is a very tight loop (which is good, because
link_path_walk() is one of the hottest code paths in the entire kernel),
and the hash mixing function must not have a longer latency to avoid
slowing it down.

There do not appear to be any published fast hash functions that:
1) Operate on the input a word at a time, and
2) Don't need to know the length of the input beforehand, and
3) Have a single iterated mixing function, not needing conditional
branches or unrolling to distinguish different loop iterations.

One of the algorithms which comes closest is Yann Collet's xxHash, but
that's two dependent multiplies per word, which is too much.

The key insights in this design are:

1) Barring expensive ops like multiplies, to diffuse one input bit
across 64 bits of hash state takes at least log2(64) = 6 sequentially
dependent instructions. That is more cycles than we'd like.
2) An operation like "hash ^= hash << 13" requires a second temporary
register anyway, and on a 2-operand machine like x86, it's three
instructions.
3) A better use of a second register is to hold a two-word hash state.
With careful design, no temporaries are needed at all, so it doesn't
increase register pressure. And this gets rid of register copying
on 2-operand machines, so the code is smaller and faster.
4) Using two words of state weakens the requirement for one-round mixing;
we now have two rounds of mixing before cancellation is possible.
5) A two-word hash state also allows operations on both halves to be
done in parallel, so on a superscalar processor we get more mixing
in fewer cycles.

I ended up using a mixing function inspired by the ChaCha and Speck
round functions. It is 6 simple instructions and 3 cycles per iteration
(assuming multiply by 9 can be done by an "lea" instruction):

x ^= *input++;
y ^= x; x = ROL(x, K1);
x += y; y = ROL(y, K2);
y *= 9;

Not only is this reversible, two consecutive rounds are reversible:
if you are given the initial and final states, but not the intermediate
state, it is possible to compute both input words. This means that at
least 3 words of input are required to create a collision.

(It also has the property, used by hash_name() to avoid a branch, that
it hashes all-zero to all-zero.)

The rotate constants K1 and K2 were found by experiment. The search took
a sample of random initial states (I used 1023) and considered the effect
of flipping each of the 64 input bits on each of the 128 output bits two
rounds later. Each of the 8192 pairs can be considered a biased coin, and
adding up the Shannon entropy of all of them produces a score.

The best-scoring shifts also did well in other tests (flipping bits in y,
trying 3 or 4 rounds of mixing, flipping all 64*63/2 pairs of input bits),
so the choice was made with the additional constraint that the sum of the
shifts is odd and not too close to the word size.

The final state is then folded into a 32-bit hash value by a less carefully
optimized multiply-based scheme. This also has to be fast, as pathname
components tend to be short (the most common case is one iteration!), but
there's some room for latency, as there is a fair bit of intervening logic
before the hash value is used for anything.

(Performance verified with "bonnie++ -s 0 -n 1536:-2" on tmpfs. I need
a better benchmark; the numbers seem to show a slight dip in performance
between 4.6.0 and this patch, but they're too noisy to quote.)

Special thanks to Bruce fields for diligent testing which uncovered a
nasty fencepost error in an earlier version of this patch.

[checkpatch.pl formatting complaints noted and respectfully disagreed with.]

Signed-off-by: George Spelvin
Tested-by: J. Bruce Fields

George Spelvin
2016-05-29 03:45:29 +0800
fcfd2fbf2 fs/namei.c: Add hashlen_string() function ... Browse Code »

We'd like to make more use of the highly-optimized dcache hash functions
throughout the kernel, rather than have every subsystem create its own,
and a function that hashes basic null-terminated strings is required
for that.

(The name is to emphasize that it returns both hash and length.)

It's actually useful in the dcache itself, specifically d_alloc_name().
Other uses in the next patch.

full_name_hash() is also tweaked to make it more generally useful:
1) Take a "char *" rather than "unsigned char *" argument, to
be consistent with hash_name().
2) Handle zero-length inputs. If we want more callers, we don't want
to make them worry about corner cases.

Signed-off-by: George Spelvin

George Spelvin
2016-05-29 03:42:50 +0800

28 May, 2016

17 commits

23a3e178b Merge tag 'upstream-4.7-rc1' of git://git.infradead.org/linux-ubifs ... Browse Code »

Pull UBI/UBIFS updates from Richard Weinberger:
"This contains mostly cleanups and minor improvements of UBI and UBIFS"

* tag 'upstream-4.7-rc1' of git://git.infradead.org/linux-ubifs:
ubifs: ubifs_dump_inode: Fix dumping field bulk_read
UBI: Fix static volume checks when Fastmap is used
UBI: Set free_count to zero before walking through erase list
UBI: Silence an unintialized variable warning
UBI: Clean up return in ubi_remove_volume()
UBI: Modify wrong comment in ubi_leb_map function.
UBI: Don't read back all data in ubi_eba_copy_leb()
UBI: Add ro-mode sysfs attribute

Linus Torvalds
2016-05-28 09:49:29 +0800
e0714ec4f nfs: fix anonymous member initializer build failure with older compilers ... Browse Code »

Older versions of gcc don't understand named initializers inside a
anonymous structure or union member. It can be worked around by adding
the bracin gin the initializer for the anonymous member.

Without this, gcc 4.4.4 will fail the build with

CC fs/nfs/nfs4state.o
fs/nfs/nfs4state.c:69: error: unknown field ‘data’ specified in initializer
fs/nfs/nfs4state.c:69: warning: missing braces around initializer
fs/nfs/nfs4state.c:69: warning: (near initialization for ‘zero_stateid..data’)
make[2]: *** [fs/nfs/nfs4state.o] Error 1

introduced in commit 93b717fd81bf ("NFSv4: Label stateids with the type")

Reported-and-tested-by: Boris Ostrovsky
Cc: Anna Schumaker
Cc: Trond Myklebust
Signed-off-by: Linus Torvalds

Linus Torvalds
2016-05-28 08:20:27 +0800
d102a56ed Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs fixes from Al Viro:
"Followups to the parallel lookup work:

- update docs

- restore killability of the places that used to take ->i_mutex
killably now that we have down_write_killable() merged

- Additionally, it turns out that I missed a prerequisite for
security_d_instantiate() stuff - ->getxattr() wasn't the only thing
that could be called before dentry is attached to inode; with smack
we needed the same treatment applied to ->setxattr() as well"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
switch ->setxattr() to passing dentry and inode separately
switch xattr_handler->set() to passing dentry and inode separately
restore killability of old mutex_lock_killable(&inode->i_mutex) users
add down_write_killable_nested()
update D/f/directory-locking

Linus Torvalds
2016-05-28 08:14:05 +0800
3767e255b switch ->setxattr() to passing dentry and inode separately ... Browse Code »

smack ->d_instantiate() uses ->setxattr(), so to be able to call it before
we'd hashed the new dentry and attached it to inode, we need ->setxattr()
instances getting the inode as an explicit argument rather than obtaining
it from dentry.

Similar change for ->getxattr() had been done in commit ce23e64. Unlike
->getxattr() (which is used by both selinux and smack instances of
->d_instantiate()) ->setxattr() is used only by smack one and unfortunately
it got missed back then.

Reported-by: Seung-Woo Kim
Tested-by: Casey Schaufler
Signed-off-by: Al Viro

Al Viro
2016-05-28 08:09:16 +0800
0121a3220 Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs ... Browse Code »

Pull overlayfs update from Miklos Szeredi:
"The meat of this is a change to use the mounter's credentials for
operations that require elevated privileges (such as whiteout
creation). This fixes behavior under user namespaces as well as being
a nice cleanup"

* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: Do d_type check only if work dir creation was successful
ovl: update documentation
ovl: override creds with the ones from the superblock mounter

Linus Torvalds
2016-05-28 07:44:39 +0800
559b6d90a Merge branch 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs ... Browse Code »

Pull btrfs cleanups and fixes from Chris Mason:
"We have another round of fixes and a few cleanups.

I have a fix for short returns from btrfs_copy_from_user, which
finally nails down a very hard to find regression we added in v4.6.

Dave is pushing around gfp parameters, mostly to cleanup internal apis
and make it a little more consistent.

The rest are smaller fixes, and one speelling fixup patch"

* 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (22 commits)
Btrfs: fix handling of faults from btrfs_copy_from_user
btrfs: fix string and comment grammatical issues and typos
btrfs: scrub: Set bbio to NULL before calling btrfs_map_block
Btrfs: fix unexpected return value of fiemap
Btrfs: free sys_array eb as soon as possible
btrfs: sink gfp parameter to convert_extent_bit
btrfs: make state preallocation more speculative in __set_extent_bit
btrfs: untangle gotos a bit in convert_extent_bit
btrfs: untangle gotos a bit in __clear_extent_bit
btrfs: untangle gotos a bit in __set_extent_bit
btrfs: sink gfp parameter to set_record_extent_bits
btrfs: sink gfp parameter to set_extent_new
btrfs: sink gfp parameter to set_extent_defrag
btrfs: sink gfp parameter to set_extent_delalloc
btrfs: sink gfp parameter to clear_extent_dirty
btrfs: sink gfp parameter to clear_record_extent_bits
btrfs: sink gfp parameter to clear_extent_bits
btrfs: sink gfp parameter to set_extent_bits
btrfs: make find_workspace warn if there are no workspaces
btrfs: make find_workspace always succeed
...

Linus Torvalds
2016-05-28 07:37:36 +0800
5d22fc25d mm: remove more IS_ERR_VALUE abuses ... Browse Code »

The do_brk() and vm_brk() return value was "unsigned long" and returned
the starting address on success, and an error value on failure. The
reasons are entirely historical, and go back to it basically behaving
like the mmap() interface does.

However, nobody actually wanted that interface, and it causes totally
pointless IS_ERR_VALUE() confusion.

What every single caller actually wants is just the simpler integer
return of zero for success and negative error number on failure.

So just convert to that much clearer and more common calling convention,
and get rid of all the IS_ERR_VALUE() uses wrt vm_brk().

Signed-off-by: Linus Torvalds

Linus Torvalds
2016-05-28 06:57:31 +0800
287980e49 remove lots of IS_ERR_VALUE abuses ... Browse Code »

Most users of IS_ERR_VALUE() in the kernel are wrong, as they
pass an 'int' into a function that takes an 'unsigned long'
argument. This happens to work because the type is sign-extended
on 64-bit architectures before it gets converted into an
unsigned type.

However, anything that passes an 'unsigned short' or 'unsigned int'
argument into IS_ERR_VALUE() is guaranteed to be broken, as are
8-bit integers and types that are wider than 'unsigned long'.

Andrzej Hajda has already fixed a lot of the worst abusers that
were causing actual bugs, but it would be nice to prevent any
users that are not passing 'unsigned long' arguments.

This patch changes all users of IS_ERR_VALUE() that I could find
on 32-bit ARM randconfig builds and x86 allmodconfig. For the
moment, this doesn't change the definition of IS_ERR_VALUE()
because there are probably still architecture specific users
elsewhere.

Almost all the warnings I got are for files that are better off
using 'if (err)' or 'if (err < 0)'.
The only legitimate user I could find that we get a warning for
is the (32-bit only) freescale fman driver, so I did not remove
the IS_ERR_VALUE() there but changed the type to 'unsigned long'.
For 9pfs, I just worked around one user whose calling conventions
are so obscure that I did not dare change the behavior.

I was using this definition for testing:

#define IS_ERR_VALUE(x) ((unsigned long*)NULL == (typeof (x)*)NULL && \
unlikely((unsigned long long)(x) >= (unsigned long long)(typeof(x))-MAX_ERRNO))

which ends up making all 16-bit or wider types work correctly with
the most plausible interpretation of what IS_ERR_VALUE() was supposed
to return according to its users, but also causes a compile-time
warning for any users that do not pass an 'unsigned long' argument.

I suggested this approach earlier this year, but back then we ended
up deciding to just fix the users that are obviously broken. After
the initial warning that caused me to get involved in the discussion
(fs/gfs2/dir.c) showed up again in the mainline kernel, Linus
asked me to send the whole thing again.

[ Updated the 9p parts as per Al Viro - Linus ]

Signed-off-by: Arnd Bergmann
Cc: Andrzej Hajda
Cc: Andrew Morton
Link: https://lkml.org/lkml/2016/1/7/363
Link: https://lkml.org/lkml/2016/5/27/486
Acked-by: Srinivas Kandagatla # For nvmem part
Signed-off-by: Linus Torvalds

Arnd Bergmann
2016-05-28 06:26:11 +0800
9ecd10b7a direct-io: fix direct write stale data exposure from concurrent buffered read ... Browse Code »

Currently direct writes inside i_size on a DIO_SKIP_HOLES filesystem are
not allowed to allocate blocks(get_more_blocks() sets 'create' to 0
before calling get_block() callback), if it's a sparse file, direct
writes fall back to buffered writes to avoid stale data exposure from
concurrent buffered read. But there're two cases that can result in
stale data exposure are not correctly detected.

1. The detection for "writing inside i_size" is not sufficient,
writes can be treated as "extending writes" wrongly. For example,
direct write 1FSB (file system block) to a 1FSB sparse file on
ext2/3/4, starting from offset 0, in this case it's writing inside
i_size, but 'create' is non-zero, because 'block_in_file' and
'(i_size_read(inode) >> blkbits' are both zero.

2. Direct writes starting from or beyong i_size (not inside i_size)
also could trigger block allocation and expose stale data. For
example, consider a sparse file with i_size of 2k, and a write to
offset 2k or 3k into the file, with a filesystem block size of 4k.
(Thanks to Jeff Moyer for pointing this case out in his review.)

The first problem can be demostrated by running ltp-aiodio test ADSP045
many times. When testing on extN filesystems, I see test failures
occasionally, buffered read could read non-zero (stale) data.

ADSP045: dio_sparse -a 4k -w 4k -s 2k -n 1

dio_sparse 0 TINFO : Dirtying free blocks
dio_sparse 0 TINFO : Starting I/O tests
non zero buffer at buf[0] => 0xffffffaa,ffffffaa,ffffffaa,ffffffaa
non-zero read at offset 0
dio_sparse 0 TINFO : Killing childrens(s)
dio_sparse 1 TFAIL : dio_sparse.c:191: 1 children(s) exited abnormally

The second problem can also be reproduced easily by a hacked dio_sparse
program, which accepts an option to specify the write offset.

What we should really do is to disable block allocation for writes that
could result in filling holes inside i_size.

Link: http://lkml.kernel.org/r/1463156728-13357-1-git-send-email-guaneryu@gmail.com
Reviewed-by: Jan Kara
Signed-off-by: Eryu Guan
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eryu Guan
2016-05-28 05:49:37 +0800
38b52efd2 ocfs2: bump up o2cb network protocol version ... Browse Code »

Two new messages are added to support negotiating hb timeout. Stop
nodes frmo talking an old version to mount as they will cause the
negotiation to fail.

Link: http://lkml.kernel.org/r/1464231615-27939-1-git-send-email-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi
Reviewed-by: Mark Fasheh
Cc: Joel Becker
Cc: Joseph Qi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Junxiao Bi
2016-05-28 05:49:37 +0800
6633ca573 ocfs2: o2hb: fix hb hung time ... Browse Code »

hr_last_timeout_start should be set as the last time where hb is
still OK. When hb write timeout, hung time will be (jiffies -
hr_last_timeout_start).

Signed-off-by: Junxiao Bi
Reviewed-by: Ryan Ding
Reviewed-by: Mark Fasheh
Cc: Gang He
Cc: rwxybh
Cc: Joel Becker
Cc: Joseph Qi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Junxiao Bi
2016-05-28 05:49:37 +0800
88dbe98dc ocfs2: o2hb: don't negotiate if last hb fail ... Browse Code »

Sometimes io error is returned when storage is down for a while. Like
for iscsi device, stroage is made offline when session timeout, and this
will make all io return -EIO. For this case, nodes shouldn't do
negotiate timeout but should fence self. So let nodes fence self when
o2hb_do_disk_heartbeat return an error, this is the same behavior with
o2hb without negotiate timer.

Signed-off-by: Junxiao Bi
Reviewed-by: Ryan Ding
Reviewed-by: Mark Fasheh
Cc: Gang He
Cc: rwxybh
Cc: Joel Becker
Cc: Joseph Qi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Junxiao Bi
2016-05-28 05:49:37 +0800
1bd129028 ocfs2: o2hb: add some user/debug log ... Browse Code »

Signed-off-by: Junxiao Bi
Reviewed-by: Ryan Ding
Reviewed-by: Mark Fasheh
Cc: Gang He
Cc: rwxybh
Cc: Joel Becker
Cc: Joseph Qi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Junxiao Bi
2016-05-28 05:49:37 +0800
e76f8237a ocfs2: o2hb: add NEGOTIATE_APPROVE message ... Browse Code »

This message is used to re-queue write timeout timer and negotiate timer
when all nodes suffer a write hung to storage, this makes node not fence
self if storage down.

Signed-off-by: Junxiao Bi
Reviewed-by: Ryan Ding
Reviewed-by: Mark Fasheh
Cc: Gang He
Cc: rwxybh
Cc: Joel Becker
Cc: Joseph Qi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Junxiao Bi
2016-05-28 05:49:37 +0800
34069b886 ocfs2: o2hb: add NEGO_TIMEOUT message ... Browse Code »

This message is sent to master node when non-master nodes's negotiate
timer expired. Master node records these nodes in a bitmap which is
used to do write timeout timer re-queue decision.

Signed-off-by: Junxiao Bi
Reviewed-by: Ryan Ding
Reviewed-by: Mark Fasheh
Cc: Gang He
Cc: rwxybh
Cc: Joel Becker
Cc: Joseph Qi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Junxiao Bi
2016-05-28 05:49:37 +0800
e0cbb7980 ocfs2: o2hb: add negotiate timer ... Browse Code »

This series of patches is to fix the issue that when storage down, all
nodes will fence self due to write timeout.

With this patch set, all nodes will keep going until storage back
online, except if the following issue happens, then all nodes will do as
before to fence self.

1. io error got
2. network between nodes down
3. nodes panic

This patch (of 6):

When storage down, all nodes will fence self due to write timeout. The
negotiate timer is designed to avoid this, with it node will wait until
storage up again.

Negotiate timer working in the following way:

1. The timer expires before write timeout timer, its timeout is half
of write timeout now. It is re-queued along with write timeout timer.
If expires, it will send NEGO_TIMEOUT message to master node(node with
lowest node number). This message does nothing but marks a bit in a
bitmap recording which nodes are negotiating timeout on master node.

2. If storage down, nodes will send this message to master node, then
when master node finds its bitmap including all online nodes, it sends
NEGO_APPROVL message to all nodes one by one, this message will
re-queue write timeout timer and negotiate timer. For any node doesn't
receive this message or meets some issue when handling this message, it
will be fenced. If storage up at any time, o2hb_thread will run and
re-queue all the timer, nothing will be affected by these two steps.

Signed-off-by: Junxiao Bi
Reviewed-by: Ryan Ding
Reviewed-by: Mark Fasheh
Cc: Gang He
Cc: rwxybh
Cc: Joel Becker
Cc: Joseph Qi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Junxiao Bi
2016-05-28 05:49:37 +0800
593012268 switch xattr_handler->set() to passing dentry and inode separately ... Browse Code »

preparation for similar switch in ->setxattr() (see the next commit for
rationale).

Signed-off-by: Al Viro

Al Viro
2016-05-28 03:39:43 +0800

27 May, 2016

10 commits

21765194c ovl: Do d_type check only if work dir creation was successful ... Browse Code »

d_type check requires successful creation of workdir as iterates
through work dir and expects work dir to be present in it. If that's
not the case, this check will always return d_type not supported even
if underlying filesystem might be supporting it.

So don't do this check if work dir creation failed in previous step.

Signed-off-by: Vivek Goyal
Signed-off-by: Miklos Szeredi

Vivek Goyal
2016-05-27 16:18:56 +0800
3fe6e52f0 ovl: override creds with the ones from the superblock mounter ... Browse Code »

In user namespace the whiteout creation fails with -EPERM because the
current process isn't capable(CAP_SYS_ADMIN) when setting xattr.

A simple reproducer:

$ mkdir upper lower work merged lower/dir
$ sudo mount -t overlay overlay -olowerdir=lower,upperdir=upper,workdir=work merged
$ unshare -m -p -f -U -r bash

Now as root in the user namespace:

\# touch merged/dir/{1,2,3} # this will force a copy up of lower/dir
\# rm -fR merged/*

This ends up failing with -EPERM after the files in dir has been
correctly deleted:

unlinkat(4, "2", 0) = 0
unlinkat(4, "1", 0) = 0
unlinkat(4, "3", 0) = 0
close(4) = 0
unlinkat(AT_FDCWD, "merged/dir", AT_REMOVEDIR) = -1 EPERM (Operation not
permitted)

Interestingly, if you don't place files in merged/dir you can remove it,
meaning if upper/dir does not exist, creating the char device file works
properly in that same location.

This patch uses ovl_sb_creator_cred() to get the cred struct from the
superblock mounter and override the old cred with these new ones so that
the whiteout creation is possible because overlay is wrong in assuming that
the creds it will get with prepare_creds will be in the initial user
namespace. The old cap_raise game is removed in favor of just overriding
the old cred struct.

This patch also drops from ovl_copy_up_one() the following two lines:

override_cred->fsuid = stat->uid;
override_cred->fsgid = stat->gid;

This is because the correct uid and gid are taken directly with the stat
struct and correctly set with ovl_set_attr().

Signed-off-by: Antonio Murdaca
Signed-off-by: Miklos Szeredi

Antonio Murdaca
2016-05-27 14:55:26 +0800
e12fab28d Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge fixes from Andrew Morton:
"10 fixes"

* emailed patches from Andrew Morton :
drivers/pinctrl/intel/pinctrl-baytrail.c: fix build with gcc-4.4
update "mm/zsmalloc: don't fail if can't create debugfs info"
dma-debug: avoid spinlock recursion when disabling dma-debug
mm: oom_reaper: remove some bloat
memcg: fix mem_cgroup_out_of_memory() return value.
ocfs2: fix improper handling of return errno
mm: slub: remove unused virt_to_obj()
mm: kasan: remove unused 'reserved' field from struct kasan_alloc_meta
mm: make CONFIG_DEFERRED_STRUCT_PAGE_INIT depends on !FLATMEM explicitly
seqlock: fix raw_read_seqcount_latch()

Linus Torvalds
2016-05-27 12:32:40 +0800
478a1469a Merge tag 'dax-locking-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm ... Browse Code »

Pull DAX locking updates from Ross Zwisler:
"Filesystem DAX locking for 4.7

- We use a bit in an exceptional radix tree entry as a lock bit and
use it similarly to how page lock is used for normal faults. This
fixes races between hole instantiation and read faults of the same
index.

- Filesystem DAX PMD faults are disabled, and will be re-enabled when
PMD locking is implemented"

* tag 'dax-locking-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
dax: Remove i_mmap_lock protection
dax: Use radix tree entry lock to protect cow faults
dax: New fault locking
dax: Allow DAX code to replace exceptional entries
dax: Define DAX lock bit for radix tree exceptional entry
dax: Make huge page handling depend of CONFIG_BROKEN
dax: Fix condition for filling of PMD holes

Linus Torvalds
2016-05-27 11:00:28 +0800
315227f6d Merge tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm ... Browse Code »

Pull misc DAX updates from Vishal Verma:
"DAX error handling for 4.7

- Until now, dax has been disabled if media errors were found on any
device. This enables the use of DAX in the presence of these
errors by making all sector-aligned zeroing go through the driver.

- The driver (already) has the ability to clear errors on writes that
are sent through the block layer using 'DSMs' defined in ACPI 6.1.

Other misc changes:

- When mounting DAX filesystems, check to make sure the partition is
page aligned. This is a requirement for DAX, and previously, we
allowed such unaligned mounts to succeed, but subsequent
reads/writes would fail.

- Misc/cleanup fixes from Jan that remove unused code from DAX
related to zeroing, writeback, and some size checks"

* tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
dax: fix a comment in dax_zero_page_range and dax_truncate_page
dax: for truncate/hole-punch, do zeroing through the driver if possible
dax: export a low-level __dax_zero_page_range helper
dax: use sb_issue_zerout instead of calling dax_clear_sectors
dax: enable dax in the presence of known media errors (badblocks)
dax: fallback from pmd to pte on error
block: Update blkdev_dax_capable() for consistency
xfs: Add alignment check for DAX mount
ext2: Add alignment check for DAX mount
ext4: Add alignment check for DAX mount
block: Add bdev_dax_supported() for dax mount checks
block: Add vfs_msg() interface
dax: Remove redundant inode size checks
dax: Remove pointless writeback from dax_do_io()
dax: Remove zeroing from dax_io()
dax: Remove dead zeroing code from fault handlers
ext2: Avoid DAX zeroing to corrupt data
ext2: Fix block zeroing in ext2_get_blocks() for DAX
dax: Remove complete_unwritten argument
DAX: move RADIX_DAX_ definitions to dax.c

Linus Torvalds
2016-05-27 10:34:26 +0800
1f3a437fa ocfs2: fix improper handling of return errno ... Browse Code »

Previously, if a bad inode was found in ocfs2_iget(), -ESTALE was
returned back to the caller anyway. Since commit d2b9d71a2da7 ("ocfs2:
check/fix inode block for online file check") can handle with return
value from ocfs2_read_locked_inode() now, we know the exact errno
returned for us.

Link: http://lkml.kernel.org/r/1463970656-18413-1-git-send-email-zren@suse.com
Signed-off-by: Eric Ren
Reviewed-by: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Junxiao Bi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric Ren
2016-05-27 06:35:44 +0800
a10c38a4f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client ... Browse Code »

Pull Ceph updates from Sage Weil:
"This changeset has a few main parts:

- Ilya has finished a huge refactoring effort to sync up the
client-side logic in libceph with the user-space client code, which
has evolved significantly over the last couple years, with lots of
additional behaviors (e.g., how requests are handled when cluster
is full and transitions from full to non-full).

This structure of the code is more closely aligned with userspace
now such that it will be much easier to maintain going forward when
behavior changes take place. There are some locking improvements
bundled in as well.

- Zheng adds multi-filesystem support (multiple namespaces within the
same Ceph cluster)

- Zheng has changed the readdir offsets and directory enumeration so
that dentry offsets are hash-based and therefore stable across
directory fragmentation events on the MDS.

- Zheng has a smorgasbord of bug fixes across fs/ceph"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (71 commits)
ceph: fix wake_up_session_cb()
ceph: don't use truncate_pagecache() to invalidate read cache
ceph: SetPageError() for writeback pages if writepages fails
ceph: handle interrupted ceph_writepage()
ceph: make ceph_update_writeable_page() uninterruptible
libceph: make ceph_osdc_wait_request() uninterruptible
ceph: handle -EAGAIN returned by ceph_update_writeable_page()
ceph: make fault/page_mkwrite return VM_FAULT_OOM for -ENOMEM
ceph: block non-fatal signals for fault/page_mkwrite
ceph: make logical calculation functions return bool
ceph: tolerate bad i_size for symlink inode
ceph: improve fragtree change detection
ceph: keep leaf frag when updating fragtree
ceph: fix dir_auth check in ceph_fill_dirfrag()
ceph: don't assume frag tree splits in mds reply are sorted
ceph: fix inode reference leak
ceph: using hash value to compose dentry offset
ceph: don't forbid marking directory complete after forward seek
ceph: record 'offset' for each entry of readdir result
ceph: define 'end/complete' in readdir reply as bit flags
...

Linus Torvalds
2016-05-27 05:10:32 +0800
56244ef15 Btrfs: fix handling of faults from btrfs_copy_from_user ... Browse Code »

When btrfs_copy_from_user isn't able to copy all of the pages, we need
to adjust our accounting to reflect the work that was actually done.

Commit 2e78c927d79 changed around the decisions a little and we ended up
skipping the accounting adjustments some of the time. This commit makes
sure that when we don't copy anything at all, we still hop into
the adjustments, and switches to release_bytes instead of write_bytes,
since write_bytes isn't aligned.

The accounting errors led to warnings during btrfs_destroy_inode:

[ 70.847532] WARNING: CPU: 10 PID: 514 at fs/btrfs/inode.c:9350 btrfs_destroy_inode+0x2b3/0x2c0
[ 70.847536] Modules linked in: i2c_piix4 virtio_net i2c_core input_leds button led_class serio_raw acpi_cpufreq sch_fq_codel autofs4 virtio_blk
[ 70.847538] CPU: 10 PID: 514 Comm: umount Tainted: G W 4.6.0-rc6_00062_g2997da1-dirty #23
[ 70.847539] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014
[ 70.847542] 0000000000000000 ffff880ff5cafab8 ffffffff8149d5e9 0000000000000202
[ 70.847543] 0000000000000000 0000000000000000 0000000000000000 ffff880ff5cafb08
[ 70.847547] ffffffff8107bdfd ffff880ff5cafaf8 000024868120013d ffff880ff5cafb28
[ 70.847547] Call Trace:
[ 70.847550] [] dump_stack+0x51/0x78
[ 70.847551] [] __warn+0xfd/0x120
[ 70.847553] [] warn_slowpath_null+0x1d/0x20
[ 70.847555] [] btrfs_destroy_inode+0x2b3/0x2c0
[ 70.847556] [] ? __destroy_inode+0x71/0x140
[ 70.847558] [] destroy_inode+0x43/0x70
[ 70.847559] [] ? wake_up_bit+0x2f/0x40
[ 70.847560] [] evict+0x148/0x1d0
[ 70.847562] [] ? start_transaction+0x3de/0x460
[ 70.847564] [] dispose_list+0x59/0x80
[ 70.847565] [] evict_inodes+0x180/0x190
[ 70.847566] [] ? __sync_filesystem+0x3f/0x50
[ 70.847568] [] generic_shutdown_super+0x48/0x100
[ 70.847569] [] ? woken_wake_function+0x20/0x20
[ 70.847571] [] kill_anon_super+0x16/0x30
[ 70.847573] [] btrfs_kill_super+0x1e/0x130
[ 70.847574] [] deactivate_locked_super+0x4e/0x90
[ 70.847576] [] deactivate_super+0x51/0x70
[ 70.847577] [] cleanup_mnt+0x3f/0x80
[ 70.847579] [] __cleanup_mnt+0x12/0x20
[ 70.847581] [] task_work_run+0x68/0xa0
[ 70.847582] [] exit_to_usermode_loop+0xd6/0xe0
[ 70.847583] [] do_syscall_64+0xbd/0x170
[ 70.847586] [] entry_SYSCALL64_slow_path+0x25/0x25

This is the test program I used to force short returns from
btrfs_copy_from_user

void *dontneed(void *arg)
{
char *p = arg;
int ret;

while(1) {
ret = madvise(p, BUFSIZE/4, MADV_DONTNEED);
if (ret) {
perror("madvise");
exit(1);
}
}
}

int main(int ac, char **av) {
int ret;
int fd;
char *filename;
unsigned long offset;
char *buf;
int i;
pthread_t tid;

if (ac != 2) {
fprintf(stderr, "usage: dammitdave filename\n");
exit(1);
}

buf = mmap(NULL, BUFSIZE, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
if (buf == MAP_FAILED) {
perror("mmap");
exit(1);
}
memset(buf, 'a', BUFSIZE);
filename = av[1];

ret = pthread_create(&tid, NULL, dontneed, buf);
if (ret) {
fprintf(stderr, "error %d from pthread_create\n", ret);
exit(1);
}

ret = pthread_detach(tid);
if (ret) {
fprintf(stderr, "pthread detach failed %d\n", ret);
exit(1);
}

while (1) {
fd = open(filename, O_RDWR | O_CREAT, 0600);
if (fd < 0) {
perror("open");
exit(1);
}

for (i = 0; i < ROUNDS; i++) {
int this_write = BUFSIZE;

offset = rand() % MAXSIZE;
ret = pwrite(fd, buf, this_write, offset);
if (ret < 0) {
perror("pwrite");
exit(1);
} else if (ret != this_write) {
fprintf(stderr, "short write to %s offset %lu ret %d\n",
filename, offset, ret);
exit(1);
}
if (i == ROUNDS - 1) {
ret = sync_file_range(fd, offset, 4096,
SYNC_FILE_RANGE_WRITE);
if (ret < 0) {
perror("sync_file_range");
exit(1);
}
}
}
ret = ftruncate(fd, 0);
if (ret < 0) {
perror("ftruncate");
exit(1);
}
ret = close(fd);
if (ret) {
perror("close");
exit(1);
}
ret = unlink(filename);
if (ret) {
perror("unlink");
exit(1);
}

}
return 0;
}

Signed-off-by: Chris Mason
Reported-by: Dave Jones
Fixes: 2e78c927d79333f299a8ac81c2fd2952caeef335
cc: stable@vger.kernel.org # v4.6
Signed-off-by: Chris Mason

Chris Mason
2016-05-27 04:23:59 +0800
ea8ea737c Merge tag 'nfs-for-4.7-1' of git://git.linux-nfs.org/projects/anna/linux-nfs ... Browse Code »

Pull NFS client updates from Anna Schumaker:
"Highlights include:

Features:
- Add support for the NFS v4.2 COPY operation
- Add support for NFS/RDMA over IPv6

Bugfixes and cleanups:
- Avoid race that crashes nfs_init_commit()
- Fix oops in callback path
- Fix LOCK/OPEN race when unlinking an open file
- Choose correct stateids when using delegations in setattr, read and
write
- Don't send empty SETATTR after OPEN_CREATE
- xprtrdma: Prevent server from writing a reply into memory client
has released
- xprtrdma: Support using Read list and Reply chunk in one RPC call"

* tag 'nfs-for-4.7-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (61 commits)
pnfs: pnfs_update_layout needs to consider if strict iomode checking is on
nfs/flexfiles: Use the layout segment for reading unless it a IOMODE_RW and reading is disabled
nfs/flexfiles: Helper function to detect FF_FLAGS_NO_READ_IO
nfs: avoid race that crashes nfs_init_commit
NFS: checking for NULL instead of IS_ERR() in nfs_commit_file()
pnfs: make pnfs_layout_process more robust
pnfs: rework LAYOUTGET retry handling
pnfs: lift retry logic from send_layoutget to pnfs_update_layout
pnfs: fix bad error handling in send_layoutget
flexfiles: add kerneldoc header to nfs4_ff_layout_prepare_ds
flexfiles: remove pointless setting of NFS_LAYOUT_RETURN_REQUESTED
pnfs: only tear down lsegs that precede seqid in LAYOUTRETURN args
pnfs: keep track of the return sequence number in pnfs_layout_hdr
pnfs: record sequence in pnfs_layout_segment when it's created
pnfs: don't merge new ff lsegs with ones that have LAYOUTRETURN bit set
pNFS/flexfiles: When initing reads or writes, we might have to retry connecting to DSes
pNFS/flexfiles: When checking for available DSes, conditionally check for MDS io
pNFS/flexfile: Fix erroneous fall back to read/write through the MDS
NFS: Reclaim writes via writepage are opportunistic
NFSv4: Use the right stateid for delegations in setattr, read and write
...

Linus Torvalds
2016-05-27 01:33:33 +0800
0b9210c9c Merge tag 'xfs-for-linus-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs ... Browse Code »

Pull xfs updates from Dave Chinner:
"A pretty average collection of fixes, cleanups and improvements in
this request.

Summary:
- fixes for mount line parsing, sparse warnings, read-only compat
feature remount behaviour
- allow fast path symlink lookups for inline symlinks.
- attribute listing cleanups
- writeback goes direct to bios rather than indirecting through
bufferheads
- transaction allocation cleanup
- optimised kmem_realloc
- added configurable error handling for metadata write errors,
changed default error handling behaviour from "retry forever" to
"retry until unmount then fail"
- fixed several inode cluster writeback lookup vs reclaim race
conditions
- fixed inode cluster writeback checking wrong inode after lookup
- fixed bugs where struct xfs_inode freeing wasn't actually RCU safe
- cleaned up inode reclaim tagging"

* tag 'xfs-for-linus-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (39 commits)
xfs: fix warning in xfs_finish_page_writeback for non-debug builds
xfs: move reclaim tagging functions
xfs: simplify inode reclaim tagging interfaces
xfs: rename variables in xfs_iflush_cluster for clarity
xfs: xfs_iflush_cluster has range issues
xfs: mark reclaimed inodes invalid earlier
xfs: xfs_inode_free() isn't RCU safe
xfs: optimise xfs_iext_destroy
xfs: skip stale inodes in xfs_iflush_cluster
xfs: fix inode validity check in xfs_iflush_cluster
xfs: xfs_iflush_cluster fails to abort on error
xfs: remove xfs_fs_evict_inode()
xfs: add "fail at unmount" error handling configuration
xfs: add configuration handlers for specific errors
xfs: add configuration of error failure speed
xfs: introduce table-based init for error behaviors
xfs: add configurable error support to metadata buffers
xfs: introduce metadata IO error class
xfs: configurable error behavior via sysfs
xfs: buffer ->bi_end_io function requires irq-safe lock
...

Linus Torvalds
2016-05-27 01:13:40 +0800

26 May, 2016

3 commits

c7d73af2d pnfs: pnfs_update_layout needs to consider if strict iomode checking is on ... Browse Code »

As flexfiles has FF_FLAGS_NO_READ_IO, there is a need to generically
support enforcing that a IOMODE_RW segment will not allow READ I/O.

Signed-off-by: Tom Haynes
Signed-off-by: Anna Schumaker

Tom Haynes
2016-05-26 20:40:56 +0800
602c4cd45 nfs/flexfiles: Use the layout segment for reading unless it a IOMODE_RW and reading is disabled ... Browse Code »

Signed-off-by: Tom Haynes
Signed-off-by: Anna Schumaker

Tom Haynes
2016-05-26 20:40:51 +0800
002354112 restore killability of old mutex_lock_killable(&inode->i_mutex) users ... Browse Code »

The ones that are taking it exclusive, that is...

Signed-off-by: Al Viro

Al Viro
2016-05-26 12:13:25 +0800