11 Sep, 2014

1 commit

  • Pull UDF fixes from Jan Kara:
    "Fixes for UDF handling of NFS handles and one fix for proper handling
    of corrupted media"

    * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
    udf: saner calling conventions for udf_new_inode()
    udf: fix the udf_iget() vs. udf_new_inode() races
    udf: merge the pieces inserting a new non-directory object into directory
    udf: Set i_generation field
    udf: Properly detect stale inodes
    udf: Make udf_read_inode() and udf_iget() return error
    udf: Avoid infinite loop when processing indirect ICBs
    udf: Fold udf_fill_inode() into __udf_read_inode()
    udf: Avoid dir link count to go negative

    Linus Torvalds
     

10 Sep, 2014

1 commit

  • Pull cifs/smb3 fixes from Steve French:
    "This includes various cifs and smb3 bug fixes including those for bugs
    found with the recently updated xfstests.

    Also I am working fixes for two additional cifs problems found by
    xfstests which I plan to send later (when reviewed and run additional
    tests)"

    * 'for-next-3.17' of git://git.samba.org/sfrench/cifs-2.6:
    Clarify Kconfig help text for CIFS and SMB2/SMB3
    CIFS: Fix wrong filename length for SMB2
    CIFS: Fix wrong restart readdir for SMB1
    CIFS: Fix directory rename error
    cifs: No need to send SIGKILL to demux_thread during umount
    cifs: Allow directIO read/write during cache=strict
    cifs: remove unneeded check of null checking in if condition
    cifs: fix a possible use of uninit variable in SMB2_sess_setup
    cifs: fix memory leak when password is supplied multiple times
    cifs: fix a possible null pointer deref in decode_ascii_ssetup
    Trivial whitespace fix

    Linus Torvalds
     

09 Sep, 2014

4 commits

  • Pull ext4 bugfix from Ted Ts'o.

    [ Hmm. It's possible we should make kfree() aware of error pointers,
    and use IS_ERR_OR_NULL rather than a NULL check. But in the meantime
    this is obviously the right fix. - Linus ]

    * 'for_linus_urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    ext4: avoid trying to kfree an ERR_PTR pointer

    Linus Torvalds
     
  • Pull nfsd bugfixes from Bruce Fields:
    "A couple minor nfsd bugfixes"

    * 'for-3.17' of git://linux-nfs.org/~bfields/linux:
    lockd: fix rpcbind crash on lockd startup failure
    nfsd4: fix rd_dircount enforcement

    Linus Torvalds
     
  • Nikita Yuschenko reported that booting a kernel with init=/bin/sh and
    then nfs mounting without portmap or rpcbind running using a busybox
    mount resulted in:

    # mount -t nfs 10.30.130.21:/opt /mnt
    svc: failed to register lockdv1 RPC service (errno 111).
    lockd_up: makesock failed, error=-111
    Unable to handle kernel paging request for data at address 0x00000030
    Faulting instruction address: 0xc055e65c
    Oops: Kernel access of bad area, sig: 11 [#1]
    MPC85xx CDS
    Modules linked in:
    CPU: 0 PID: 1338 Comm: mount Not tainted 3.10.44.cge #117
    task: cf29cea0 ti: cf35c000 task.ti: cf35c000
    NIP: c055e65c LR: c0566490 CTR: c055e648
    REGS: cf35dad0 TRAP: 0300 Not tainted (3.10.44.cge)
    MSR: 00029000 CR: 22442488 XER: 20000000
    DEAR: 00000030, ESR: 00000000

    GPR00: c05606f4 cf35db80 cf29cea0 cf0ded80 cf0dedb8 00000001 1dec3086
    00000000
    GPR08: 00000000 c07b1640 00000007 1dec3086 22442482 100b9758 00000000
    10090ae8
    GPR16: 00000000 000186a5 00000000 00000000 100c3018 bfa46edc 100b0000
    bfa46ef0
    GPR24: cf386ae0 c07834f0 00000000 c0565f88 00000001 cf0dedb8 00000000
    cf0ded80
    NIP [c055e65c] call_start+0x14/0x34
    LR [c0566490] __rpc_execute+0x70/0x250
    Call Trace:
    [cf35db80] [00000080] 0x80 (unreliable)
    [cf35dbb0] [c05606f4] rpc_run_task+0x9c/0xc4
    [cf35dbc0] [c0560840] rpc_call_sync+0x50/0xb8
    [cf35dbf0] [c056ee90] rpcb_register_call+0x54/0x84
    [cf35dc10] [c056f24c] rpcb_register+0xf8/0x10c
    [cf35dc70] [c0569e18] svc_unregister.isra.23+0x100/0x108
    [cf35dc90] [c0569e38] svc_rpcb_cleanup+0x18/0x30
    [cf35dca0] [c0198c5c] lockd_up+0x1dc/0x2e0
    [cf35dcd0] [c0195348] nlmclnt_init+0x2c/0xc8
    [cf35dcf0] [c015bb5c] nfs_start_lockd+0x98/0xec
    [cf35dd20] [c015ce6c] nfs_create_server+0x1e8/0x3f4
    [cf35dd90] [c0171590] nfs3_create_server+0x10/0x44
    [cf35dda0] [c016528c] nfs_try_mount+0x158/0x1e4
    [cf35de20] [c01670d0] nfs_fs_mount+0x434/0x8c8
    [cf35de70] [c00cd3bc] mount_fs+0x20/0xbc
    [cf35de90] [c00e4f88] vfs_kern_mount+0x50/0x104
    [cf35dec0] [c00e6e0c] do_mount+0x1d0/0x8e0
    [cf35df10] [c00e75ac] SyS_mount+0x90/0xd0
    [cf35df40] [c000ccf4] ret_from_syscall+0x0/0x3c

    The addition of svc_shutdown_net() resulted in two calls to
    svc_rpcb_cleanup(); the second is no longer necessary and crashes when
    it calls rpcb_register_call with clnt=NULL.

    Reported-by: Nikita Yushchenko
    Fixes: 679b033df484 "lockd: ensure we tear down any live sockets when socket creation fails during lockd_up"
    Cc: stable@vger.kernel.org
    Acked-by: Jeff Layton
    Signed-off-by: J. Bruce Fields

    J. Bruce Fields
     
  • Commit 3b299709091b "nfsd4: enforce rd_dircount" totally misunderstood
    rd_dircount; it refers to total non-attribute bytes returned, not number
    of directory entries returned.

    Bring the code into agreement with RFC 3530 section 14.2.24.

    Cc: stable@vger.kernel.org
    Fixes: 3b299709091b "nfsd4: enforce rd_dircount"
    Signed-off-by: J. Bruce Fields

    J. Bruce Fields
     

08 Sep, 2014

2 commits

  • Pull filesystem fixes from Al Viro:
    "Several bugfixes (all of them -stable fodder).

    Alexey's one deals with double mutex_lock() in UFS (apparently, nobody
    has tried to test "ufs: sb mutex merge + mutex_destroy" on something
    like file creation/removal on ufs). Mine deal with two kinds of
    umount bugs, in umount propagation and in handling of automounted
    submounts, both resulting in bogus transient EBUSY from umount"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    ufs: fix deadlocks introduced by sb mutex merge
    fix EBUSY on umount() from MNT_SHRINKABLE
    get rid of propagate_umount() mistakenly treating slaves as busy.

    Linus Torvalds
     
  • Commit 0244756edc4b ("ufs: sb mutex merge + mutex_destroy") introduces
    deadlocks in ufs_new_inode() and ufs_free_inode().
    Most callers of that functions acqure the mutex by themselves and
    ufs_{new,free}_inode() do that via lock_ufs(),
    i.e we have an unavoidable double lock.

    The patch proposes to resolve the issue by making sure that
    ufs_{new,free}_inode() are not called with the mutex held.

    Found by Linux Driver Verification project (linuxtesting.org).

    Cc: stable@vger.kernel.org # 3.16
    Signed-off-by: Alexey Khoroshilov
    Signed-off-by: Al Viro

    Alexey Khoroshilov
     

07 Sep, 2014

1 commit

  • Pull xfs fixes from Dave Chinner:
    "The fixes all address recently discovered data corruption issues.

    The original Direct IO issue was discovered by Chris Mason @ Facebook
    on a production workload which mixed buffered reads with direct reads
    and writes IO to the same file. The fix for that exposed other issues
    with page invalidation (exposed by millions of fsx operations) failing
    due to dirty buffers beyond EOF.

    Finally, the collapse_range code could also cause problems due to
    racing writeback changing the extent map while it was being shifted
    around. The commits for that problem are simple mitigation fixes that
    prevent the problem from occuring. A more robust fix for 3.18 that
    addresses the underlying problem is currently being worked on by
    Brian.

    Summary of fixes:
    - a direct IO read/buffered read data corruption
    - the associated fallout from the DIO data corruption fix
    - collapse range bugs that are potential data corruption issues"

    * tag 'xfs-for-linus-3.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
    xfs: trim eofblocks before collapse range
    xfs: xfs_file_collapse_range is delalloc challenged
    xfs: don't log inode unless extent shift makes extent modifications
    xfs: use ranged writeback and invalidation for direct IO
    xfs: don't zero partial page cache pages during O_DIRECT writes
    xfs: don't zero partial page cache pages during O_DIRECT writes
    xfs: don't dirty buffers beyond EOF

    Linus Torvalds
     

05 Sep, 2014

9 commits

  • This patch changes sync_filesystem() to be EXPORT_SYMBOL().

    The reason this is needed is that starting with 3.15 kernel, due to
    Theodore Ts'o's commit 02b9984d6408 ("fs: push sync_filesystem() down to
    the file system's remount_fs()"), all file systems that have dirty data
    to be written out need to call sync_filesystem() from their
    ->remount_fs() method when remounting read-only.

    As this is now a generically required function rather than an internal
    only function it should be EXPORT_SYMBOL() so that all file systems can
    call it.

    Signed-off-by: Anton Altaparmakov
    Acked-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anton Altaparmakov
     
  • Pull aio bugfixes from Ben LaHaise:
    "Two small fixes"

    * git://git.kvack.org/~bcrl/aio-fixes:
    aio: block exit_aio() until all context requests are completed
    aio: add missing smp_rmb() in read_events_ring

    Linus Torvalds
     
  • It seems that exit_aio() also needs to wait for all iocbs to complete (like
    io_destroy), but we missed the wait step in current implemention, so fix
    it in the same way as we did in io_destroy.

    Signed-off-by: Gu Zheng
    Signed-off-by: Benjamin LaHaise
    Cc: stable@vger.kernel.org

    Gu Zheng
     
  • Signed-off-by: Al Viro
    Signed-off-by: Jan Kara

    Al Viro
     
  • Currently udf_iget() (triggered by NFS) can race with udf_new_inode()
    leading to two inode structures with the same inode number:

    nfsd: iget_locked() creates inode
    nfsd: try to read from disk, block on that.
    udf_new_inode(): allocate inode with that inumber
    udf_new_inode(): insert it into icache, set it up and dirty
    udf_write_inode(): write inode into buffer cache
    nfsd: get CPU again, look into buffer cache, see nice and sane on-disk
    inode, set the in-core inode from it

    Fix the problem by putting inode into icache in locked state (I_NEW set)
    and unlocking it only after it's fully set up.

    Signed-off-by: Al Viro
    Signed-off-by: Jan Kara

    Al Viro
     
  • boilerplate code in udf_{create,mknod,symlink} taken to new helper

    symlink case converted to unique id calculated by udf_new_inode() - no
    point finding a new one.

    Signed-off-by: Al Viro
    Signed-off-by: Jan Kara

    Al Viro
     
  • Currently UDF doesn't initialize i_generation in any way and thus NFS
    can easily get reallocated inodes from stale file handles. Luckily UDF
    already has a unique object identifier associated with each inode -
    i_unique. Use that for initialization of i_generation.

    Signed-off-by: Jan Kara

    Jan Kara
     
  • NFS can easily ask for inodes that are already deleted. Currently UDF
    happily returns such inodes which is a bug. Return -ESTALE if
    udf_read_inode() is asked to read deleted inode.

    Signed-off-by: Jan Kara

    Jan Kara
     
  • Currently __udf_read_inode() wasn't returning anything and we found out
    whether we succeeded reading inode by checking whether inode is bad or
    not. udf_iget() returned NULL on failure and inode pointer otherwise.
    Make these two functions properly propagate errors up the call stack and
    use the return value in callers.

    Signed-off-by: Jan Kara

    Jan Kara
     

04 Sep, 2014

4 commits

  • We did not implement any bound on number of indirect ICBs we follow when
    loading inode. Thus corrupted medium could cause kernel to go into an
    infinite loop, possibly causing a stack overflow.

    Fix the possible stack overflow by removing recursion from
    __udf_read_inode() and limit number of indirect ICBs we follow to avoid
    infinite loops.

    Signed-off-by: Jan Kara

    Jan Kara
     
  • There's no good reason to separate these since udf_fill_inode() is
    called only from __udf_read_inode() and both do part of the same thing.

    Signed-off-by: Jan Kara

    Jan Kara
     
  • If we are writing back inode of unlinked directory, its link count ends
    up being (u16)-1. Although the inode is deleted, udf_iget() can load the
    inode when NFS uses stale file handle and get confused.

    Signed-off-by: Jan Kara

    Jan Kara
     
  • Pull f2fs bug fixes from Jaegeuk Kim:
    "This series includes patches to:

    - fix recovery routines
    - fix bugs related to inline_data/xattr
    - fix when casting the dentry names
    - handle EIO or ENOMEM correctly
    - fix memory leak
    - fix lock coverage"

    * tag 'for-f2fs-3.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (28 commits)
    f2fs: reposition unlock_new_inode to prevent accessing invalid inode
    f2fs: fix wrong casting for dentry name
    f2fs: simplify by using a literal
    f2fs: truncate stale block for inline_data
    f2fs: use macro for code readability
    f2fs: introduce need_do_checkpoint for readability
    f2fs: fix incorrect calculation with total/free inode num
    f2fs: remove rename and use rename2
    f2fs: skip if inline_data was converted already
    f2fs: remove rewrite_node_page
    f2fs: avoid double lock in truncate_blocks
    f2fs: prevent checkpoint during roll-forward
    f2fs: add WARN_ON in f2fs_bug_on
    f2fs: handle EIO not to break fs consistency
    f2fs: check s_dirty under cp_mutex
    f2fs: unlock_page when node page is redirtied out
    f2fs: introduce f2fs_cp_error for readability
    f2fs: give a chance to mount again when encountering errors
    f2fs: trigger release_dirty_inode in f2fs_put_super
    f2fs: don't skip checkpoint if there is no dirty node pages
    ...

    Linus Torvalds
     

03 Sep, 2014

2 commits

  • Thanks to Dan Carpenter for extending smatch to find bugs like this.
    (This was found using a development version of smatch.)

    Fixes: 36de928641ee48b2078d3fe9514242aaa2f92013
    Reported-by: Dan Carpenter
    Cc: stable@vger.kernel.org

    Theodore Ts'o
     
  • We ran into a case on ppc64 running mariadb where io_getevents would
    return zeroed out I/O events. After adding instrumentation, it became
    clear that there was some missing synchronization between reading the
    tail pointer and the events themselves. This small patch fixes the
    problem in testing.

    Thanks to Zach for helping to look into this, and suggesting the fix.

    Signed-off-by: Jeff Moyer
    Signed-off-by: Benjamin LaHaise
    Cc: stable@vger.kernel.org

    Jeff Moyer
     

02 Sep, 2014

8 commits

  • As the race condition on the inode cache, following scenario can appear:
    [Thread a] [Thread b]
    ->f2fs_mkdir
    ->f2fs_add_link
    ->__f2fs_add_link
    ->init_inode_metadata failed here
    ->gc_thread_func
    ->f2fs_gc
    ->do_garbage_collect
    ->gc_data_segment
    ->f2fs_iget
    ->iget_locked
    ->wait_on_inode
    ->unlock_new_inode
    ->move_data_page
    ->make_bad_inode
    ->iput

    When we fail in create/symlink/mkdir/mknod/tmpfile, the new allocated inode
    should be set as bad to avoid being accessed by other thread. But in above
    scenario, it allows f2fs to access the invalid inode before this inode was set
    as bad.
    This patch fix the potential problem, and this issue was found by code review.

    change log from v1:
    o Add condition judgment in gc_data_segment() suggested by Changman Lee.
    o use iget_failed to simplify code.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • xfs_collapse_file_space() currently writes back the entire file
    undergoing collapse range to settle things down for the extent shift
    algorithm. While this prevents changes to the extent list during the
    collapse operation, the writeback itself is not enough to prevent
    unnecessary collapse failures.

    The current shift algorithm uses the extent index to iterate the in-core
    extent list. If a post-eof delalloc extent persists after the writeback
    (e.g., a prior zero range op where the end of the range aligns with eof
    can separate the post-eof blocks such that they are not written back and
    converted), xfs_bmap_shift_extents() becomes confused over the encoded
    br_startblock value and fails the collapse.

    As with the full writeback, this is a temporary fix until the algorithm
    is improved to cope with a volatile extent list and avoid attempts to
    shift post-eof extents.

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Brian Foster
     
  • If we have delalloc extents on a file before we run a collapse range
    opertaion, we sync the range that we are going to collapse to
    convert delalloc extents in that region to real extents to simplify
    the shift operation.

    However, the shift operation then assumes that the extent list is
    not going to change as it iterates over the extent list moving
    things about. Unfortunately, this isn't true because we can't hold
    the ILOCK over all the operations. We can prevent new IO from
    modifying the extent list by holding the IOLOCK, but that doesn't
    prevent writeback from running....

    And when writeback runs, it can convert delalloc extents is the
    range of the file prior to the region being collapsed, and this
    changes the indexes of all the extents in the file. That causes the
    collapse range operation to Go Bad.

    The right fix is to rewrite the extent shift operation not to be
    dependent on the extent list not changing across the entire
    operation, but this is a fairly significant piece of work to do.
    Hence, as a short-term workaround for the problem, sync the entire
    file before starting a collapse operation to remove all delalloc
    ranges from the file and so avoid the problem of concurrent
    writeback changing the extent list.

    Diagnosed-and-Reported-by: Brian Foster
    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • The file collapse mechanism uses xfs_bmap_shift_extents() to collapse
    all subsequent extents down into the specified, previously punched out,
    region. This function performs some validation, such as whether a
    sufficient hole exists in the target region of the collapse, then shifts
    the remaining exents downward.

    The exit path of the function currently logs the inode unconditionally.
    While we must log the inode (and abort) if an error occurs and the
    transaction is dirty, the initial validation paths can generate errors
    before the transaction has been dirtied. This creates an unnecessary
    filesystem shutdown scenario, as the caller will cancel a transaction
    that has been marked dirty.

    Modify xfs_bmap_shift_extents() to OR the logflags bits as modifications
    are made to the inode bmap. Only log the inode in the exit path if
    logflags has been set. This ensures we only have to cancel a dirty
    transaction if modifications have been made and prevents an unnecessary
    filesystem shutdown otherwise.

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Brian Foster
     
  • Now we are not doing silly things with dirtying buffers beyond EOF
    and using invalidation correctly, we can finally reduce the ranges of
    writeback and invalidation used by direct IO to match that of the IO
    being issued.

    Bring the writeback and invalidation ranges back to match the
    generic direct IO code - this will greatly reduce the perturbation
    of cached data when direct IO and buffered IO are mixed, but still
    provide the same buffered vs direct IO coherency behaviour we
    currently have.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • Similar to direct IO reads, direct IO writes are using
    truncate_pagecache_range to invalidate the page cache. This is
    incorrect due to the sub-block zeroing in the page cache that
    truncate_pagecache_range() triggers.

    This patch fixes things by using invalidate_inode_pages2_range
    instead. It preserves the page cache invalidation, but won't zero
    any pages.

    cc: stable@vger.kernel.org
    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • xfs is using truncate_pagecache_range to invalidate the page cache
    during DIO reads. This is different from the other filesystems who
    only invalidate pages during DIO writes.

    truncate_pagecache_range is meant to be used when we are freeing the
    underlying data structs from disk, so it will zero any partial
    ranges in the page. This means a DIO read can zero out part of the
    page cache page, and it is possible the page will stay in cache.

    buffered reads will find an up to date page with zeros instead of
    the data actually on disk.

    This patch fixes things by using invalidate_inode_pages2_range
    instead. It preserves the page cache invalidation, but won't zero
    any pages.

    [dchinner: catch error and warn if it fails. Comment.]

    cc: stable@vger.kernel.org
    Signed-off-by: Chris Mason
    Reviewed-by: Dave Chinner
    Reviewed-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Chris Mason
     
  • generic/263 is failing fsx at this point with a page spanning
    EOF that cannot be invalidated. The operations are:

    1190 mapwrite 0x52c00 thru 0x5e569 (0xb96a bytes)
    1191 mapread 0x5c000 thru 0x5d636 (0x1637 bytes)
    1192 write 0x5b600 thru 0x771ff (0x1bc00 bytes)

    where 1190 extents EOF from 0x54000 to 0x5e569. When the direct IO
    write attempts to invalidate the cached page over this range, it
    fails with -EBUSY and so any attempt to do page invalidation fails.

    The real question is this: Why can't that page be invalidated after
    it has been written to disk and cleaned?

    Well, there's data on the first two buffers in the page (1k block
    size, 4k page), but the third buffer on the page (i.e. beyond EOF)
    is failing drop_buffers because it's bh->b_state == 0x3, which is
    BH_Uptodate | BH_Dirty. IOWs, there's dirty buffers beyond EOF. Say
    what?

    OK, set_buffer_dirty() is called on all buffers from
    __set_page_buffers_dirty(), regardless of whether the buffer is
    beyond EOF or not, which means that when we get to ->writepage,
    we have buffers marked dirty beyond EOF that we need to clean.
    So, we need to implement our own .set_page_dirty method that
    doesn't dirty buffers beyond EOF.

    This is messy because the buffer code is not meant to be shared
    and it has interesting locking issues on the buffer dirty bits.
    So just copy and paste it and then modify it to suit what we need.

    Note: the solutions the other filesystems and generic block code use
    of marking the buffers clean in ->writepage does not work for XFS.
    It still leaves dirty buffers beyond EOF and invalidations still
    fail. Hence rather than play whack-a-mole, this patch simply
    prevents those buffers from being dirtied in the first place.

    cc:
    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     

31 Aug, 2014

3 commits

  • Pull file locking bugfx from Jeff Layton:
    "Just a bugfix for a bug that crept in to v3.15. It's in a rather rare
    error path, and I'm not aware of anyone having hit it, but it's worth
    fixing for v3.17"

    * tag 'locks-v3.17-3' of git://git.samba.org/jlayton/linux:
    locks: pass correct "before" pointer to locks_unlink_lock in generic_add_lease

    Linus Torvalds
     
  • We need the parents of victims alive until namespace_unlock() gets to
    dput() of the (ex-)mountpoints. However, that screws up the "is it
    busy" checks in case when we have shrinkable mounts that need to be
    killed. Solution: go ahead and decrement refcounts of parents right
    in umount_tree(), increment them again just before dropping rwsem in
    namespace_unlock() (and let the loop in the end of namespace_unlock()
    finally drop those references for good, as we do now). Parents can't
    get freed until we drop rwsem - at least one reference is kept until
    then, both in case when parent is among the victims and when it is
    not. So they'll still be around when we get to namespace_unlock().

    Cc: stable@vger.kernel.org # 3.12+
    Signed-off-by: Al Viro

    Al Viro
     
  • The check in __propagate_umount() ("has somebody explicitly mounted
    something on that slave?") is done *before* taking the already doomed
    victims out of the child lists.

    Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro

    Al Viro
     

30 Aug, 2014

5 commits

  • Merge patches from Andrew Morton:
    "22 fixes"

    * emailed patches from Andrew Morton : (22 commits)
    kexec: purgatory: add clean-up for purgatory directory
    Documentation/kdump/kdump.txt: add ARM description
    flush_icache_range: export symbol to fix build errors
    tools: selftests: fix build issue with make kselftests target
    ocfs2: quorum: add a log for node not fenced
    ocfs2: o2net: set tcp user timeout to max value
    ocfs2: o2net: don't shutdown connection when idle timeout
    ocfs2: do not write error flag to user structure we cannot copy from/to
    x86/purgatory: use approprate -m64/-32 build flag for arch/x86/purgatory
    drivers/rtc/rtc-s5m.c: re-add support for devices without irq specified
    xattr: fix check for simultaneous glibc header inclusion
    kexec: remove CONFIG_KEXEC dependency on crypto
    kexec: create a new config option CONFIG_KEXEC_FILE for new syscall
    x86,mm: fix pte_special versus pte_numa
    hugetlb_cgroup: use lockdep_assert_held rather than spin_is_locked
    mm/zpool: use prefixed module loading
    zram: fix incorrect stat with failed_reads
    lib: turn CONFIG_STACKTRACE into an actual option.
    mm: actually clear pmd_numa before invalidating
    memblock, memhotplug: fix wrong type in memblock_find_in_range_node().
    ...

    Linus Torvalds
     
  • For debug use, we can see from the log whether the fence decision is
    made and why it is not fenced.

    Signed-off-by: Junxiao Bi
    Reviewed-by: Srinivas Eeda
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junxiao Bi
     
  • When tcp retransmit timeout(15mins), the connection will be closed.
    Pending messages may be lost during this time. So we set tcp user
    timeout to override the retransmit timeout to the max value. This is OK
    for ocfs2 since we have disk heartbeat, if peer crash, the disk
    heartbeat will timeout and it will be evicted, if disk heartbeat not
    timeout and connection idle for a long time, then this means the cluster
    enters split-brain state, since fence can't happen, we'd better keep the
    connection and wait network recover.

    Signed-off-by: Junxiao Bi
    Reviewed-by: Srinivas Eeda
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junxiao Bi
     
  • This patch series is to fix a possible message lost bug in ocfs2 when
    network go bad. This bug will cause ocfs2 hung forever even network
    become good again.

    The messages may lost in this case. After the tcp connection is
    established between two nodes, an idle timer will be set to check its
    state periodically, if no messages are received during this time, idle
    timer will timeout, it will shutdown the connection and try to
    reconnect, so pending messages in tcp queues will be lost. This
    messages may be from dlm. Dlm may get hung in this case. This may
    cause the whole ocfs2 cluster hung.

    This is very possible to happen when network state goes bad. Do the
    reconnect is useless, it will fail if network state is still bad. Just
    waiting there for network recovering may be a good idea, it will not
    lost messages and some node will be fenced until cluster goes into
    split-brain state, for this case, Tcp user timeout is used to override
    the tcp retransmit timeout. It will timeout after 25 days, user should
    have notice this through the provided log and fix the network, if they
    don't, ocfs2 will fall back to original reconnect way.

    This patch (of 3):

    Some messages in the tcp queue maybe lost if we shutdown the connection
    and reconnect when idle timeout. If packets lost and reconnect success,
    then the ocfs2 cluster maybe hung.

    To fix this, we can leave the connection there and do the fence decision
    when idle timeout, if network recover before fence dicision is made, the
    connection survive without lost any messages.

    This bug can be saw when network state go bad. It may cause ocfs2 hung
    forever if some packets lost. With this fix, ocfs2 will recover from
    hung if network becomes good again.

    Signed-off-by: Junxiao Bi
    Reviewed-by: Srinivas Eeda
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junxiao Bi
     
  • If we failed to copy from the structure, writing back the flags leaks 31
    bits of kernel memory (the rest of the ir_flags field).

    In any case, if we cannot copy from/to the structure, why should we
    expect putting just the flags to work?

    Also make sure ocfs2_info_handle_freeinode() returns the right error
    code if the copy_to_user() fails.

    Fixes: ddee5cdb70e6 ('Ocfs2: Add new OCFS2_IOC_INFO ioctl for ocfs2 v8.')
    Signed-off-by: Ben Hutchings
    Cc: Joel Becker
    Acked-by: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Hutchings