29 May, 2016

10 commits

  • The self-test was updated to cover zero-length strings; the function
    needs to be updated, too.

    Reported-by: Geert Uytterhoeven
    Signed-off-by: George Spelvin
    Fixes: fcfd2fbf22d2 ("fs/namei.c: Add hashlen_string() function")
    Signed-off-by: Linus Torvalds

    George Spelvin
     
  • The original name was simply hash_string(), but that conflicted with a
    function with that name in drivers/base/power/trace.c, and I decided
    that calling it "hashlen_" was better anyway.

    But you have to do it in two places.

    [ This caused build errors for architectures that don't define
    CONFIG_DCACHE_WORD_ACCESS - Linus ]

    Signed-off-by: George Spelvin
    Reported-by: Guenter Roeck
    Fixes: fcfd2fbf22d2 ("fs/namei.c: Add hashlen_string() function")
    Signed-off-by: Linus Torvalds

    George Spelvin
     
  • The HPFS filesystem used generic_show_options to produce string that is
    displayed in /proc/mounts. However, there is a problem that the options
    may disappear after remount. If we mount the filesystem with option1
    and then remount it with option2, /proc/mounts should show both option1
    and option2, however it only shows option2 because the whole option
    string is replaced with replace_mount_options in hpfs_remount_fs.

    To fix this bug, implement the hpfs_show_options function that prints
    options that are currently selected.

    Signed-off-by: Mikulas Patocka
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Mikulas Patocka
     
  • Commit c8f33d0bec99 ("affs: kstrdup() memory handling") checks if the
    kstrdup function returns NULL due to out-of-memory condition.

    However, if we are remounting a filesystem with no change to
    filesystem-specific options, the parameter data is NULL. In this case,
    kstrdup returns NULL (because it was passed NULL parameter), although no
    out of memory condition exists. The mount syscall then fails with
    ENOMEM.

    This patch fixes the bug. We fail with ENOMEM only if data is non-NULL.

    The patch also changes the call to replace_mount_options - if we didn't
    pass any filesystem-specific options, we don't call
    replace_mount_options (thus we don't erase existing reported options).

    Fixes: c8f33d0bec99 ("affs: kstrdup() memory handling")
    Signed-off-by: Mikulas Patocka
    Cc: stable@vger.kernel.org # v4.1+
    Signed-off-by: Linus Torvalds

    Mikulas Patocka
     
  • Commit ce657611baf9 ("hpfs: kstrdup() out of memory handling") checks if
    the kstrdup function returns NULL due to out-of-memory condition.

    However, if we are remounting a filesystem with no change to
    filesystem-specific options, the parameter data is NULL. In this case,
    kstrdup returns NULL (because it was passed NULL parameter), although no
    out of memory condition exists. The mount syscall then fails with
    ENOMEM.

    This patch fixes the bug. We fail with ENOMEM only if data is non-NULL.

    The patch also changes the call to replace_mount_options - if we didn't
    pass any filesystem-specific options, we don't call
    replace_mount_options (thus we don't erase existing reported options).

    Fixes: ce657611baf9 ("hpfs: kstrdup() out of memory handling")
    Signed-off-by: Mikulas Patocka
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Mikulas Patocka
     
  • Various builds (such as i386:allmodconfig) fail with

    fs/binfmt_aout.c:133:2: error: expected identifier or '(' before 'return'
    fs/binfmt_aout.c:134:1: error: expected identifier or '(' before '}' token

    [ Oops. My bad, I had stupidly thought that "allmodconfig" covered this
    on x86-64 too, but it obviously doesn't. Egg on my face. - Linus ]

    Fixes: 5d22fc25d4fc ("mm: remove more IS_ERR_VALUE abuses")
    Signed-off-by: Guenter Roeck
    Signed-off-by: Linus Torvalds

    Guenter Roeck
     
  • Pull string hash improvements from George Spelvin:
    "This series does several related things:

    - Makes the dcache hash (fs/namei.c) useful for general kernel use.

    (Thanks to Bruce for noticing the zero-length corner case)

    - Converts the string hashes in to use the
    above.

    - Avoids 64-bit multiplies in hash_64() on 32-bit platforms. Two
    32-bit multiplies will do well enough.

    - Rids the world of the bad hash multipliers in hash_32.

    This finishes the job started in commit 689de1d6ca95 ("Minimal
    fix-up of bad hashing behavior of hash_64()")

    The vast majority of Linux architectures have hardware support for
    32x32-bit multiply and so derive no benefit from "simplified"
    multipliers.

    The few processors that do not (68000, h8/300 and some models of
    Microblaze) have arch-specific implementations added. Those
    patches are last in the series.

    - Overhauls the dcache hash mixing.

    The patch in commit 0fed3ac866ea ("namei: Improve hash mixing if
    CONFIG_DCACHE_WORD_ACCESS") was an off-the-cuff suggestion.
    Replaced with a much more careful design that's simultaneously
    faster and better. (My own invention, as there was noting suitable
    in the literature I could find. Comments welcome!)

    - Modify the hash_name() loop to skip the initial HASH_MIX(). This
    would let us salt the hash if we ever wanted to.

    - Sort out partial_name_hash().

    The hash function is declared as using a long state, even though
    it's truncated to 32 bits at the end and the extra internal state
    contributes nothing to the result. And some callers do odd things:

    - fs/hfs/string.c only allocates 32 bits of state
    - fs/hfsplus/unicode.c uses it to hash 16-bit unicode symbols not bytes

    - Modify bytemask_from_count to handle inputs of 1..sizeof(long)
    rather than 0..sizeof(long)-1. This would simplify users other
    than full_name_hash"

    Special thanks to Bruce Fields for testing and finding bugs in v1. (I
    learned some humbling lessons about "obviously correct" code.)

    On the arch-specific front, the m68k assembly has been tested in a
    standalone test harness, I've been in contact with the Microblaze
    maintainers who mostly don't care, as the hardware multiplier is never
    omitted in real-world applications, and I haven't heard anything from
    the H8/300 world"

    * 'hash' of git://ftp.sciencehorizons.net/linux:
    h8300: Add
    microblaze: Add
    m68k: Add
    : Add support for architecture-specific functions
    fs/namei.c: Improve dcache hash function
    Eliminate bad hash multipliers from hash_32() and hash_64()
    Change hash_64() return value to 32 bits
    : Define hash_str() in terms of hashlen_string()
    fs/namei.c: Add hashlen_string() function
    Pull out string hash to

    Linus Torvalds
     
  • This is just the infrastructure; there are no users yet.

    This is modelled on CONFIG_ARCH_RANDOM; a CONFIG_ symbol declares
    the existence of .

    That file may define its own versions of various functions, and define
    HAVE_* symbols (no CONFIG_ prefix!) to suppress the generic ones.

    Included is a self-test (in lib/test_hash.c) that verifies the basics.
    It is NOT in general required that the arch-specific functions compute
    the same thing as the generic, but if a HAVE_* symbol is defined with
    the value 1, then equality is tested.

    Signed-off-by: George Spelvin
    Cc: Geert Uytterhoeven
    Cc: Greg Ungerer
    Cc: Andreas Schwab
    Cc: Philippe De Muyter
    Cc: linux-m68k@lists.linux-m68k.org
    Cc: Alistair Francis
    Cc: Michal Simek
    Cc: Yoshinori Sato
    Cc: uclinux-h8-devel@lists.sourceforge.jp

    George Spelvin
     
  • Patch 0fed3ac866 improved the hash mixing, but the function is slower
    than necessary; there's a 7-instruction dependency chain (10 on x86)
    each loop iteration.

    Word-at-a-time access is a very tight loop (which is good, because
    link_path_walk() is one of the hottest code paths in the entire kernel),
    and the hash mixing function must not have a longer latency to avoid
    slowing it down.

    There do not appear to be any published fast hash functions that:
    1) Operate on the input a word at a time, and
    2) Don't need to know the length of the input beforehand, and
    3) Have a single iterated mixing function, not needing conditional
    branches or unrolling to distinguish different loop iterations.

    One of the algorithms which comes closest is Yann Collet's xxHash, but
    that's two dependent multiplies per word, which is too much.

    The key insights in this design are:

    1) Barring expensive ops like multiplies, to diffuse one input bit
    across 64 bits of hash state takes at least log2(64) = 6 sequentially
    dependent instructions. That is more cycles than we'd like.
    2) An operation like "hash ^= hash << 13" requires a second temporary
    register anyway, and on a 2-operand machine like x86, it's three
    instructions.
    3) A better use of a second register is to hold a two-word hash state.
    With careful design, no temporaries are needed at all, so it doesn't
    increase register pressure. And this gets rid of register copying
    on 2-operand machines, so the code is smaller and faster.
    4) Using two words of state weakens the requirement for one-round mixing;
    we now have two rounds of mixing before cancellation is possible.
    5) A two-word hash state also allows operations on both halves to be
    done in parallel, so on a superscalar processor we get more mixing
    in fewer cycles.

    I ended up using a mixing function inspired by the ChaCha and Speck
    round functions. It is 6 simple instructions and 3 cycles per iteration
    (assuming multiply by 9 can be done by an "lea" instruction):

    x ^= *input++;
    y ^= x; x = ROL(x, K1);
    x += y; y = ROL(y, K2);
    y *= 9;

    Not only is this reversible, two consecutive rounds are reversible:
    if you are given the initial and final states, but not the intermediate
    state, it is possible to compute both input words. This means that at
    least 3 words of input are required to create a collision.

    (It also has the property, used by hash_name() to avoid a branch, that
    it hashes all-zero to all-zero.)

    The rotate constants K1 and K2 were found by experiment. The search took
    a sample of random initial states (I used 1023) and considered the effect
    of flipping each of the 64 input bits on each of the 128 output bits two
    rounds later. Each of the 8192 pairs can be considered a biased coin, and
    adding up the Shannon entropy of all of them produces a score.

    The best-scoring shifts also did well in other tests (flipping bits in y,
    trying 3 or 4 rounds of mixing, flipping all 64*63/2 pairs of input bits),
    so the choice was made with the additional constraint that the sum of the
    shifts is odd and not too close to the word size.

    The final state is then folded into a 32-bit hash value by a less carefully
    optimized multiply-based scheme. This also has to be fast, as pathname
    components tend to be short (the most common case is one iteration!), but
    there's some room for latency, as there is a fair bit of intervening logic
    before the hash value is used for anything.

    (Performance verified with "bonnie++ -s 0 -n 1536:-2" on tmpfs. I need
    a better benchmark; the numbers seem to show a slight dip in performance
    between 4.6.0 and this patch, but they're too noisy to quote.)

    Special thanks to Bruce fields for diligent testing which uncovered a
    nasty fencepost error in an earlier version of this patch.

    [checkpatch.pl formatting complaints noted and respectfully disagreed with.]

    Signed-off-by: George Spelvin
    Tested-by: J. Bruce Fields

    George Spelvin
     
  • We'd like to make more use of the highly-optimized dcache hash functions
    throughout the kernel, rather than have every subsystem create its own,
    and a function that hashes basic null-terminated strings is required
    for that.

    (The name is to emphasize that it returns both hash and length.)

    It's actually useful in the dcache itself, specifically d_alloc_name().
    Other uses in the next patch.

    full_name_hash() is also tweaked to make it more generally useful:
    1) Take a "char *" rather than "unsigned char *" argument, to
    be consistent with hash_name().
    2) Handle zero-length inputs. If we want more callers, we don't want
    to make them worry about corner cases.

    Signed-off-by: George Spelvin

    George Spelvin
     

28 May, 2016

17 commits

  • Pull UBI/UBIFS updates from Richard Weinberger:
    "This contains mostly cleanups and minor improvements of UBI and UBIFS"

    * tag 'upstream-4.7-rc1' of git://git.infradead.org/linux-ubifs:
    ubifs: ubifs_dump_inode: Fix dumping field bulk_read
    UBI: Fix static volume checks when Fastmap is used
    UBI: Set free_count to zero before walking through erase list
    UBI: Silence an unintialized variable warning
    UBI: Clean up return in ubi_remove_volume()
    UBI: Modify wrong comment in ubi_leb_map function.
    UBI: Don't read back all data in ubi_eba_copy_leb()
    UBI: Add ro-mode sysfs attribute

    Linus Torvalds
     
  • Older versions of gcc don't understand named initializers inside a
    anonymous structure or union member. It can be worked around by adding
    the bracin gin the initializer for the anonymous member.

    Without this, gcc 4.4.4 will fail the build with

    CC fs/nfs/nfs4state.o
    fs/nfs/nfs4state.c:69: error: unknown field ‘data’ specified in initializer
    fs/nfs/nfs4state.c:69: warning: missing braces around initializer
    fs/nfs/nfs4state.c:69: warning: (near initialization for ‘zero_stateid..data’)
    make[2]: *** [fs/nfs/nfs4state.o] Error 1

    introduced in commit 93b717fd81bf ("NFSv4: Label stateids with the type")

    Reported-and-tested-by: Boris Ostrovsky
    Cc: Anna Schumaker
    Cc: Trond Myklebust
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Pull vfs fixes from Al Viro:
    "Followups to the parallel lookup work:

    - update docs

    - restore killability of the places that used to take ->i_mutex
    killably now that we have down_write_killable() merged

    - Additionally, it turns out that I missed a prerequisite for
    security_d_instantiate() stuff - ->getxattr() wasn't the only thing
    that could be called before dentry is attached to inode; with smack
    we needed the same treatment applied to ->setxattr() as well"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    switch ->setxattr() to passing dentry and inode separately
    switch xattr_handler->set() to passing dentry and inode separately
    restore killability of old mutex_lock_killable(&inode->i_mutex) users
    add down_write_killable_nested()
    update D/f/directory-locking

    Linus Torvalds
     
  • smack ->d_instantiate() uses ->setxattr(), so to be able to call it before
    we'd hashed the new dentry and attached it to inode, we need ->setxattr()
    instances getting the inode as an explicit argument rather than obtaining
    it from dentry.

    Similar change for ->getxattr() had been done in commit ce23e64. Unlike
    ->getxattr() (which is used by both selinux and smack instances of
    ->d_instantiate()) ->setxattr() is used only by smack one and unfortunately
    it got missed back then.

    Reported-by: Seung-Woo Kim
    Tested-by: Casey Schaufler
    Signed-off-by: Al Viro

    Al Viro
     
  • Pull overlayfs update from Miklos Szeredi:
    "The meat of this is a change to use the mounter's credentials for
    operations that require elevated privileges (such as whiteout
    creation). This fixes behavior under user namespaces as well as being
    a nice cleanup"

    * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
    ovl: Do d_type check only if work dir creation was successful
    ovl: update documentation
    ovl: override creds with the ones from the superblock mounter

    Linus Torvalds
     
  • Pull btrfs cleanups and fixes from Chris Mason:
    "We have another round of fixes and a few cleanups.

    I have a fix for short returns from btrfs_copy_from_user, which
    finally nails down a very hard to find regression we added in v4.6.

    Dave is pushing around gfp parameters, mostly to cleanup internal apis
    and make it a little more consistent.

    The rest are smaller fixes, and one speelling fixup patch"

    * 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (22 commits)
    Btrfs: fix handling of faults from btrfs_copy_from_user
    btrfs: fix string and comment grammatical issues and typos
    btrfs: scrub: Set bbio to NULL before calling btrfs_map_block
    Btrfs: fix unexpected return value of fiemap
    Btrfs: free sys_array eb as soon as possible
    btrfs: sink gfp parameter to convert_extent_bit
    btrfs: make state preallocation more speculative in __set_extent_bit
    btrfs: untangle gotos a bit in convert_extent_bit
    btrfs: untangle gotos a bit in __clear_extent_bit
    btrfs: untangle gotos a bit in __set_extent_bit
    btrfs: sink gfp parameter to set_record_extent_bits
    btrfs: sink gfp parameter to set_extent_new
    btrfs: sink gfp parameter to set_extent_defrag
    btrfs: sink gfp parameter to set_extent_delalloc
    btrfs: sink gfp parameter to clear_extent_dirty
    btrfs: sink gfp parameter to clear_record_extent_bits
    btrfs: sink gfp parameter to clear_extent_bits
    btrfs: sink gfp parameter to set_extent_bits
    btrfs: make find_workspace warn if there are no workspaces
    btrfs: make find_workspace always succeed
    ...

    Linus Torvalds
     
  • The do_brk() and vm_brk() return value was "unsigned long" and returned
    the starting address on success, and an error value on failure. The
    reasons are entirely historical, and go back to it basically behaving
    like the mmap() interface does.

    However, nobody actually wanted that interface, and it causes totally
    pointless IS_ERR_VALUE() confusion.

    What every single caller actually wants is just the simpler integer
    return of zero for success and negative error number on failure.

    So just convert to that much clearer and more common calling convention,
    and get rid of all the IS_ERR_VALUE() uses wrt vm_brk().

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Most users of IS_ERR_VALUE() in the kernel are wrong, as they
    pass an 'int' into a function that takes an 'unsigned long'
    argument. This happens to work because the type is sign-extended
    on 64-bit architectures before it gets converted into an
    unsigned type.

    However, anything that passes an 'unsigned short' or 'unsigned int'
    argument into IS_ERR_VALUE() is guaranteed to be broken, as are
    8-bit integers and types that are wider than 'unsigned long'.

    Andrzej Hajda has already fixed a lot of the worst abusers that
    were causing actual bugs, but it would be nice to prevent any
    users that are not passing 'unsigned long' arguments.

    This patch changes all users of IS_ERR_VALUE() that I could find
    on 32-bit ARM randconfig builds and x86 allmodconfig. For the
    moment, this doesn't change the definition of IS_ERR_VALUE()
    because there are probably still architecture specific users
    elsewhere.

    Almost all the warnings I got are for files that are better off
    using 'if (err)' or 'if (err < 0)'.
    The only legitimate user I could find that we get a warning for
    is the (32-bit only) freescale fman driver, so I did not remove
    the IS_ERR_VALUE() there but changed the type to 'unsigned long'.
    For 9pfs, I just worked around one user whose calling conventions
    are so obscure that I did not dare change the behavior.

    I was using this definition for testing:

    #define IS_ERR_VALUE(x) ((unsigned long*)NULL == (typeof (x)*)NULL && \
    unlikely((unsigned long long)(x) >= (unsigned long long)(typeof(x))-MAX_ERRNO))

    which ends up making all 16-bit or wider types work correctly with
    the most plausible interpretation of what IS_ERR_VALUE() was supposed
    to return according to its users, but also causes a compile-time
    warning for any users that do not pass an 'unsigned long' argument.

    I suggested this approach earlier this year, but back then we ended
    up deciding to just fix the users that are obviously broken. After
    the initial warning that caused me to get involved in the discussion
    (fs/gfs2/dir.c) showed up again in the mainline kernel, Linus
    asked me to send the whole thing again.

    [ Updated the 9p parts as per Al Viro - Linus ]

    Signed-off-by: Arnd Bergmann
    Cc: Andrzej Hajda
    Cc: Andrew Morton
    Link: https://lkml.org/lkml/2016/1/7/363
    Link: https://lkml.org/lkml/2016/5/27/486
    Acked-by: Srinivas Kandagatla # For nvmem part
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     
  • Currently direct writes inside i_size on a DIO_SKIP_HOLES filesystem are
    not allowed to allocate blocks(get_more_blocks() sets 'create' to 0
    before calling get_block() callback), if it's a sparse file, direct
    writes fall back to buffered writes to avoid stale data exposure from
    concurrent buffered read. But there're two cases that can result in
    stale data exposure are not correctly detected.

    1. The detection for "writing inside i_size" is not sufficient,
    writes can be treated as "extending writes" wrongly. For example,
    direct write 1FSB (file system block) to a 1FSB sparse file on
    ext2/3/4, starting from offset 0, in this case it's writing inside
    i_size, but 'create' is non-zero, because 'block_in_file' and
    '(i_size_read(inode) >> blkbits' are both zero.

    2. Direct writes starting from or beyong i_size (not inside i_size)
    also could trigger block allocation and expose stale data. For
    example, consider a sparse file with i_size of 2k, and a write to
    offset 2k or 3k into the file, with a filesystem block size of 4k.
    (Thanks to Jeff Moyer for pointing this case out in his review.)

    The first problem can be demostrated by running ltp-aiodio test ADSP045
    many times. When testing on extN filesystems, I see test failures
    occasionally, buffered read could read non-zero (stale) data.

    ADSP045: dio_sparse -a 4k -w 4k -s 2k -n 1

    dio_sparse 0 TINFO : Dirtying free blocks
    dio_sparse 0 TINFO : Starting I/O tests
    non zero buffer at buf[0] => 0xffffffaa,ffffffaa,ffffffaa,ffffffaa
    non-zero read at offset 0
    dio_sparse 0 TINFO : Killing childrens(s)
    dio_sparse 1 TFAIL : dio_sparse.c:191: 1 children(s) exited abnormally

    The second problem can also be reproduced easily by a hacked dio_sparse
    program, which accepts an option to specify the write offset.

    What we should really do is to disable block allocation for writes that
    could result in filling holes inside i_size.

    Link: http://lkml.kernel.org/r/1463156728-13357-1-git-send-email-guaneryu@gmail.com
    Reviewed-by: Jan Kara
    Signed-off-by: Eryu Guan
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eryu Guan
     
  • Two new messages are added to support negotiating hb timeout. Stop
    nodes frmo talking an old version to mount as they will cause the
    negotiation to fail.

    Link: http://lkml.kernel.org/r/1464231615-27939-1-git-send-email-junxiao.bi@oracle.com
    Signed-off-by: Junxiao Bi
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junxiao Bi
     
  • hr_last_timeout_start should be set as the last time where hb is
    still OK. When hb write timeout, hung time will be (jiffies -
    hr_last_timeout_start).

    Signed-off-by: Junxiao Bi
    Reviewed-by: Ryan Ding
    Reviewed-by: Mark Fasheh
    Cc: Gang He
    Cc: rwxybh
    Cc: Joel Becker
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junxiao Bi
     
  • Sometimes io error is returned when storage is down for a while. Like
    for iscsi device, stroage is made offline when session timeout, and this
    will make all io return -EIO. For this case, nodes shouldn't do
    negotiate timeout but should fence self. So let nodes fence self when
    o2hb_do_disk_heartbeat return an error, this is the same behavior with
    o2hb without negotiate timer.

    Signed-off-by: Junxiao Bi
    Reviewed-by: Ryan Ding
    Reviewed-by: Mark Fasheh
    Cc: Gang He
    Cc: rwxybh
    Cc: Joel Becker
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junxiao Bi
     
  • Signed-off-by: Junxiao Bi
    Reviewed-by: Ryan Ding
    Reviewed-by: Mark Fasheh
    Cc: Gang He
    Cc: rwxybh
    Cc: Joel Becker
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junxiao Bi
     
  • This message is used to re-queue write timeout timer and negotiate timer
    when all nodes suffer a write hung to storage, this makes node not fence
    self if storage down.

    Signed-off-by: Junxiao Bi
    Reviewed-by: Ryan Ding
    Reviewed-by: Mark Fasheh
    Cc: Gang He
    Cc: rwxybh
    Cc: Joel Becker
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junxiao Bi
     
  • This message is sent to master node when non-master nodes's negotiate
    timer expired. Master node records these nodes in a bitmap which is
    used to do write timeout timer re-queue decision.

    Signed-off-by: Junxiao Bi
    Reviewed-by: Ryan Ding
    Reviewed-by: Mark Fasheh
    Cc: Gang He
    Cc: rwxybh
    Cc: Joel Becker
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junxiao Bi
     
  • This series of patches is to fix the issue that when storage down, all
    nodes will fence self due to write timeout.

    With this patch set, all nodes will keep going until storage back
    online, except if the following issue happens, then all nodes will do as
    before to fence self.

    1. io error got
    2. network between nodes down
    3. nodes panic

    This patch (of 6):

    When storage down, all nodes will fence self due to write timeout. The
    negotiate timer is designed to avoid this, with it node will wait until
    storage up again.

    Negotiate timer working in the following way:

    1. The timer expires before write timeout timer, its timeout is half
    of write timeout now. It is re-queued along with write timeout timer.
    If expires, it will send NEGO_TIMEOUT message to master node(node with
    lowest node number). This message does nothing but marks a bit in a
    bitmap recording which nodes are negotiating timeout on master node.

    2. If storage down, nodes will send this message to master node, then
    when master node finds its bitmap including all online nodes, it sends
    NEGO_APPROVL message to all nodes one by one, this message will
    re-queue write timeout timer and negotiate timer. For any node doesn't
    receive this message or meets some issue when handling this message, it
    will be fenced. If storage up at any time, o2hb_thread will run and
    re-queue all the timer, nothing will be affected by these two steps.

    Signed-off-by: Junxiao Bi
    Reviewed-by: Ryan Ding
    Reviewed-by: Mark Fasheh
    Cc: Gang He
    Cc: rwxybh
    Cc: Joel Becker
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junxiao Bi
     
  • preparation for similar switch in ->setxattr() (see the next commit for
    rationale).

    Signed-off-by: Al Viro

    Al Viro
     

27 May, 2016

10 commits

  • d_type check requires successful creation of workdir as iterates
    through work dir and expects work dir to be present in it. If that's
    not the case, this check will always return d_type not supported even
    if underlying filesystem might be supporting it.

    So don't do this check if work dir creation failed in previous step.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Miklos Szeredi

    Vivek Goyal
     
  • In user namespace the whiteout creation fails with -EPERM because the
    current process isn't capable(CAP_SYS_ADMIN) when setting xattr.

    A simple reproducer:

    $ mkdir upper lower work merged lower/dir
    $ sudo mount -t overlay overlay -olowerdir=lower,upperdir=upper,workdir=work merged
    $ unshare -m -p -f -U -r bash

    Now as root in the user namespace:

    \# touch merged/dir/{1,2,3} # this will force a copy up of lower/dir
    \# rm -fR merged/*

    This ends up failing with -EPERM after the files in dir has been
    correctly deleted:

    unlinkat(4, "2", 0) = 0
    unlinkat(4, "1", 0) = 0
    unlinkat(4, "3", 0) = 0
    close(4) = 0
    unlinkat(AT_FDCWD, "merged/dir", AT_REMOVEDIR) = -1 EPERM (Operation not
    permitted)

    Interestingly, if you don't place files in merged/dir you can remove it,
    meaning if upper/dir does not exist, creating the char device file works
    properly in that same location.

    This patch uses ovl_sb_creator_cred() to get the cred struct from the
    superblock mounter and override the old cred with these new ones so that
    the whiteout creation is possible because overlay is wrong in assuming that
    the creds it will get with prepare_creds will be in the initial user
    namespace. The old cap_raise game is removed in favor of just overriding
    the old cred struct.

    This patch also drops from ovl_copy_up_one() the following two lines:

    override_cred->fsuid = stat->uid;
    override_cred->fsgid = stat->gid;

    This is because the correct uid and gid are taken directly with the stat
    struct and correctly set with ovl_set_attr().

    Signed-off-by: Antonio Murdaca
    Signed-off-by: Miklos Szeredi

    Antonio Murdaca
     
  • Merge fixes from Andrew Morton:
    "10 fixes"

    * emailed patches from Andrew Morton :
    drivers/pinctrl/intel/pinctrl-baytrail.c: fix build with gcc-4.4
    update "mm/zsmalloc: don't fail if can't create debugfs info"
    dma-debug: avoid spinlock recursion when disabling dma-debug
    mm: oom_reaper: remove some bloat
    memcg: fix mem_cgroup_out_of_memory() return value.
    ocfs2: fix improper handling of return errno
    mm: slub: remove unused virt_to_obj()
    mm: kasan: remove unused 'reserved' field from struct kasan_alloc_meta
    mm: make CONFIG_DEFERRED_STRUCT_PAGE_INIT depends on !FLATMEM explicitly
    seqlock: fix raw_read_seqcount_latch()

    Linus Torvalds
     
  • Pull DAX locking updates from Ross Zwisler:
    "Filesystem DAX locking for 4.7

    - We use a bit in an exceptional radix tree entry as a lock bit and
    use it similarly to how page lock is used for normal faults. This
    fixes races between hole instantiation and read faults of the same
    index.

    - Filesystem DAX PMD faults are disabled, and will be re-enabled when
    PMD locking is implemented"

    * tag 'dax-locking-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    dax: Remove i_mmap_lock protection
    dax: Use radix tree entry lock to protect cow faults
    dax: New fault locking
    dax: Allow DAX code to replace exceptional entries
    dax: Define DAX lock bit for radix tree exceptional entry
    dax: Make huge page handling depend of CONFIG_BROKEN
    dax: Fix condition for filling of PMD holes

    Linus Torvalds
     
  • Pull misc DAX updates from Vishal Verma:
    "DAX error handling for 4.7

    - Until now, dax has been disabled if media errors were found on any
    device. This enables the use of DAX in the presence of these
    errors by making all sector-aligned zeroing go through the driver.

    - The driver (already) has the ability to clear errors on writes that
    are sent through the block layer using 'DSMs' defined in ACPI 6.1.

    Other misc changes:

    - When mounting DAX filesystems, check to make sure the partition is
    page aligned. This is a requirement for DAX, and previously, we
    allowed such unaligned mounts to succeed, but subsequent
    reads/writes would fail.

    - Misc/cleanup fixes from Jan that remove unused code from DAX
    related to zeroing, writeback, and some size checks"

    * tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    dax: fix a comment in dax_zero_page_range and dax_truncate_page
    dax: for truncate/hole-punch, do zeroing through the driver if possible
    dax: export a low-level __dax_zero_page_range helper
    dax: use sb_issue_zerout instead of calling dax_clear_sectors
    dax: enable dax in the presence of known media errors (badblocks)
    dax: fallback from pmd to pte on error
    block: Update blkdev_dax_capable() for consistency
    xfs: Add alignment check for DAX mount
    ext2: Add alignment check for DAX mount
    ext4: Add alignment check for DAX mount
    block: Add bdev_dax_supported() for dax mount checks
    block: Add vfs_msg() interface
    dax: Remove redundant inode size checks
    dax: Remove pointless writeback from dax_do_io()
    dax: Remove zeroing from dax_io()
    dax: Remove dead zeroing code from fault handlers
    ext2: Avoid DAX zeroing to corrupt data
    ext2: Fix block zeroing in ext2_get_blocks() for DAX
    dax: Remove complete_unwritten argument
    DAX: move RADIX_DAX_ definitions to dax.c

    Linus Torvalds
     
  • Previously, if a bad inode was found in ocfs2_iget(), -ESTALE was
    returned back to the caller anyway. Since commit d2b9d71a2da7 ("ocfs2:
    check/fix inode block for online file check") can handle with return
    value from ocfs2_read_locked_inode() now, we know the exact errno
    returned for us.

    Link: http://lkml.kernel.org/r/1463970656-18413-1-git-send-email-zren@suse.com
    Signed-off-by: Eric Ren
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Ren
     
  • Pull Ceph updates from Sage Weil:
    "This changeset has a few main parts:

    - Ilya has finished a huge refactoring effort to sync up the
    client-side logic in libceph with the user-space client code, which
    has evolved significantly over the last couple years, with lots of
    additional behaviors (e.g., how requests are handled when cluster
    is full and transitions from full to non-full).

    This structure of the code is more closely aligned with userspace
    now such that it will be much easier to maintain going forward when
    behavior changes take place. There are some locking improvements
    bundled in as well.

    - Zheng adds multi-filesystem support (multiple namespaces within the
    same Ceph cluster)

    - Zheng has changed the readdir offsets and directory enumeration so
    that dentry offsets are hash-based and therefore stable across
    directory fragmentation events on the MDS.

    - Zheng has a smorgasbord of bug fixes across fs/ceph"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (71 commits)
    ceph: fix wake_up_session_cb()
    ceph: don't use truncate_pagecache() to invalidate read cache
    ceph: SetPageError() for writeback pages if writepages fails
    ceph: handle interrupted ceph_writepage()
    ceph: make ceph_update_writeable_page() uninterruptible
    libceph: make ceph_osdc_wait_request() uninterruptible
    ceph: handle -EAGAIN returned by ceph_update_writeable_page()
    ceph: make fault/page_mkwrite return VM_FAULT_OOM for -ENOMEM
    ceph: block non-fatal signals for fault/page_mkwrite
    ceph: make logical calculation functions return bool
    ceph: tolerate bad i_size for symlink inode
    ceph: improve fragtree change detection
    ceph: keep leaf frag when updating fragtree
    ceph: fix dir_auth check in ceph_fill_dirfrag()
    ceph: don't assume frag tree splits in mds reply are sorted
    ceph: fix inode reference leak
    ceph: using hash value to compose dentry offset
    ceph: don't forbid marking directory complete after forward seek
    ceph: record 'offset' for each entry of readdir result
    ceph: define 'end/complete' in readdir reply as bit flags
    ...

    Linus Torvalds
     
  • When btrfs_copy_from_user isn't able to copy all of the pages, we need
    to adjust our accounting to reflect the work that was actually done.

    Commit 2e78c927d79 changed around the decisions a little and we ended up
    skipping the accounting adjustments some of the time. This commit makes
    sure that when we don't copy anything at all, we still hop into
    the adjustments, and switches to release_bytes instead of write_bytes,
    since write_bytes isn't aligned.

    The accounting errors led to warnings during btrfs_destroy_inode:

    [ 70.847532] WARNING: CPU: 10 PID: 514 at fs/btrfs/inode.c:9350 btrfs_destroy_inode+0x2b3/0x2c0
    [ 70.847536] Modules linked in: i2c_piix4 virtio_net i2c_core input_leds button led_class serio_raw acpi_cpufreq sch_fq_codel autofs4 virtio_blk
    [ 70.847538] CPU: 10 PID: 514 Comm: umount Tainted: G W 4.6.0-rc6_00062_g2997da1-dirty #23
    [ 70.847539] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014
    [ 70.847542] 0000000000000000 ffff880ff5cafab8 ffffffff8149d5e9 0000000000000202
    [ 70.847543] 0000000000000000 0000000000000000 0000000000000000 ffff880ff5cafb08
    [ 70.847547] ffffffff8107bdfd ffff880ff5cafaf8 000024868120013d ffff880ff5cafb28
    [ 70.847547] Call Trace:
    [ 70.847550] [] dump_stack+0x51/0x78
    [ 70.847551] [] __warn+0xfd/0x120
    [ 70.847553] [] warn_slowpath_null+0x1d/0x20
    [ 70.847555] [] btrfs_destroy_inode+0x2b3/0x2c0
    [ 70.847556] [] ? __destroy_inode+0x71/0x140
    [ 70.847558] [] destroy_inode+0x43/0x70
    [ 70.847559] [] ? wake_up_bit+0x2f/0x40
    [ 70.847560] [] evict+0x148/0x1d0
    [ 70.847562] [] ? start_transaction+0x3de/0x460
    [ 70.847564] [] dispose_list+0x59/0x80
    [ 70.847565] [] evict_inodes+0x180/0x190
    [ 70.847566] [] ? __sync_filesystem+0x3f/0x50
    [ 70.847568] [] generic_shutdown_super+0x48/0x100
    [ 70.847569] [] ? woken_wake_function+0x20/0x20
    [ 70.847571] [] kill_anon_super+0x16/0x30
    [ 70.847573] [] btrfs_kill_super+0x1e/0x130
    [ 70.847574] [] deactivate_locked_super+0x4e/0x90
    [ 70.847576] [] deactivate_super+0x51/0x70
    [ 70.847577] [] cleanup_mnt+0x3f/0x80
    [ 70.847579] [] __cleanup_mnt+0x12/0x20
    [ 70.847581] [] task_work_run+0x68/0xa0
    [ 70.847582] [] exit_to_usermode_loop+0xd6/0xe0
    [ 70.847583] [] do_syscall_64+0xbd/0x170
    [ 70.847586] [] entry_SYSCALL64_slow_path+0x25/0x25

    This is the test program I used to force short returns from
    btrfs_copy_from_user

    void *dontneed(void *arg)
    {
    char *p = arg;
    int ret;

    while(1) {
    ret = madvise(p, BUFSIZE/4, MADV_DONTNEED);
    if (ret) {
    perror("madvise");
    exit(1);
    }
    }
    }

    int main(int ac, char **av) {
    int ret;
    int fd;
    char *filename;
    unsigned long offset;
    char *buf;
    int i;
    pthread_t tid;

    if (ac != 2) {
    fprintf(stderr, "usage: dammitdave filename\n");
    exit(1);
    }

    buf = mmap(NULL, BUFSIZE, PROT_READ|PROT_WRITE,
    MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
    if (buf == MAP_FAILED) {
    perror("mmap");
    exit(1);
    }
    memset(buf, 'a', BUFSIZE);
    filename = av[1];

    ret = pthread_create(&tid, NULL, dontneed, buf);
    if (ret) {
    fprintf(stderr, "error %d from pthread_create\n", ret);
    exit(1);
    }

    ret = pthread_detach(tid);
    if (ret) {
    fprintf(stderr, "pthread detach failed %d\n", ret);
    exit(1);
    }

    while (1) {
    fd = open(filename, O_RDWR | O_CREAT, 0600);
    if (fd < 0) {
    perror("open");
    exit(1);
    }

    for (i = 0; i < ROUNDS; i++) {
    int this_write = BUFSIZE;

    offset = rand() % MAXSIZE;
    ret = pwrite(fd, buf, this_write, offset);
    if (ret < 0) {
    perror("pwrite");
    exit(1);
    } else if (ret != this_write) {
    fprintf(stderr, "short write to %s offset %lu ret %d\n",
    filename, offset, ret);
    exit(1);
    }
    if (i == ROUNDS - 1) {
    ret = sync_file_range(fd, offset, 4096,
    SYNC_FILE_RANGE_WRITE);
    if (ret < 0) {
    perror("sync_file_range");
    exit(1);
    }
    }
    }
    ret = ftruncate(fd, 0);
    if (ret < 0) {
    perror("ftruncate");
    exit(1);
    }
    ret = close(fd);
    if (ret) {
    perror("close");
    exit(1);
    }
    ret = unlink(filename);
    if (ret) {
    perror("unlink");
    exit(1);
    }

    }
    return 0;
    }

    Signed-off-by: Chris Mason
    Reported-by: Dave Jones
    Fixes: 2e78c927d79333f299a8ac81c2fd2952caeef335
    cc: stable@vger.kernel.org # v4.6
    Signed-off-by: Chris Mason

    Chris Mason
     
  • Pull NFS client updates from Anna Schumaker:
    "Highlights include:

    Features:
    - Add support for the NFS v4.2 COPY operation
    - Add support for NFS/RDMA over IPv6

    Bugfixes and cleanups:
    - Avoid race that crashes nfs_init_commit()
    - Fix oops in callback path
    - Fix LOCK/OPEN race when unlinking an open file
    - Choose correct stateids when using delegations in setattr, read and
    write
    - Don't send empty SETATTR after OPEN_CREATE
    - xprtrdma: Prevent server from writing a reply into memory client
    has released
    - xprtrdma: Support using Read list and Reply chunk in one RPC call"

    * tag 'nfs-for-4.7-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (61 commits)
    pnfs: pnfs_update_layout needs to consider if strict iomode checking is on
    nfs/flexfiles: Use the layout segment for reading unless it a IOMODE_RW and reading is disabled
    nfs/flexfiles: Helper function to detect FF_FLAGS_NO_READ_IO
    nfs: avoid race that crashes nfs_init_commit
    NFS: checking for NULL instead of IS_ERR() in nfs_commit_file()
    pnfs: make pnfs_layout_process more robust
    pnfs: rework LAYOUTGET retry handling
    pnfs: lift retry logic from send_layoutget to pnfs_update_layout
    pnfs: fix bad error handling in send_layoutget
    flexfiles: add kerneldoc header to nfs4_ff_layout_prepare_ds
    flexfiles: remove pointless setting of NFS_LAYOUT_RETURN_REQUESTED
    pnfs: only tear down lsegs that precede seqid in LAYOUTRETURN args
    pnfs: keep track of the return sequence number in pnfs_layout_hdr
    pnfs: record sequence in pnfs_layout_segment when it's created
    pnfs: don't merge new ff lsegs with ones that have LAYOUTRETURN bit set
    pNFS/flexfiles: When initing reads or writes, we might have to retry connecting to DSes
    pNFS/flexfiles: When checking for available DSes, conditionally check for MDS io
    pNFS/flexfile: Fix erroneous fall back to read/write through the MDS
    NFS: Reclaim writes via writepage are opportunistic
    NFSv4: Use the right stateid for delegations in setattr, read and write
    ...

    Linus Torvalds
     
  • Pull xfs updates from Dave Chinner:
    "A pretty average collection of fixes, cleanups and improvements in
    this request.

    Summary:
    - fixes for mount line parsing, sparse warnings, read-only compat
    feature remount behaviour
    - allow fast path symlink lookups for inline symlinks.
    - attribute listing cleanups
    - writeback goes direct to bios rather than indirecting through
    bufferheads
    - transaction allocation cleanup
    - optimised kmem_realloc
    - added configurable error handling for metadata write errors,
    changed default error handling behaviour from "retry forever" to
    "retry until unmount then fail"
    - fixed several inode cluster writeback lookup vs reclaim race
    conditions
    - fixed inode cluster writeback checking wrong inode after lookup
    - fixed bugs where struct xfs_inode freeing wasn't actually RCU safe
    - cleaned up inode reclaim tagging"

    * tag 'xfs-for-linus-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (39 commits)
    xfs: fix warning in xfs_finish_page_writeback for non-debug builds
    xfs: move reclaim tagging functions
    xfs: simplify inode reclaim tagging interfaces
    xfs: rename variables in xfs_iflush_cluster for clarity
    xfs: xfs_iflush_cluster has range issues
    xfs: mark reclaimed inodes invalid earlier
    xfs: xfs_inode_free() isn't RCU safe
    xfs: optimise xfs_iext_destroy
    xfs: skip stale inodes in xfs_iflush_cluster
    xfs: fix inode validity check in xfs_iflush_cluster
    xfs: xfs_iflush_cluster fails to abort on error
    xfs: remove xfs_fs_evict_inode()
    xfs: add "fail at unmount" error handling configuration
    xfs: add configuration handlers for specific errors
    xfs: add configuration of error failure speed
    xfs: introduce table-based init for error behaviors
    xfs: add configurable error support to metadata buffers
    xfs: introduce metadata IO error class
    xfs: configurable error behavior via sysfs
    xfs: buffer ->bi_end_io function requires irq-safe lock
    ...

    Linus Torvalds
     

26 May, 2016

3 commits