06 Nov, 2015

11 commits

  • Merge patch-bomb from Andrew Morton:

    - inotify tweaks

    - some ocfs2 updates (many more are awaiting review)

    - various misc bits

    - kernel/watchdog.c updates

    - Some of mm. I have a huge number of MM patches this time and quite a
    lot of it is quite difficult and much will be held over to next time.

    * emailed patches from Andrew Morton : (162 commits)
    selftests: vm: add tests for lock on fault
    mm: mlock: add mlock flags to enable VM_LOCKONFAULT usage
    mm: introduce VM_LOCKONFAULT
    mm: mlock: add new mlock system call
    mm: mlock: refactor mlock, munlock, and munlockall code
    kasan: always taint kernel on report
    mm, slub, kasan: enable user tracking by default with KASAN=y
    kasan: use IS_ALIGNED in memory_is_poisoned_8()
    kasan: Fix a type conversion error
    lib: test_kasan: add some testcases
    kasan: update reference to kasan prototype repo
    kasan: move KASAN_SANITIZE in arch/x86/boot/Makefile
    kasan: various fixes in documentation
    kasan: update log messages
    kasan: accurately determine the type of the bad access
    kasan: update reported bug types for kernel memory accesses
    kasan: update reported bug types for not user nor kernel memory accesses
    mm/kasan: prevent deadlock in kasan reporting
    mm/kasan: don't use kasan shadow pointer in generic functions
    mm/kasan: MODULE_VADDR is not available on all archs
    ...

    Linus Torvalds
     
  • readahead_pages in ocfs2_duplicate_clusters_by_page is defined but not
    used, so clean it up.

    Signed-off-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • A node can mount multiple ocfs2 volumes. And if thread names are same for
    each volume/domain, it will bring inconvenience when analyzing problems
    because we have to identify which volume/domain the messages belong to.

    Since thread name will be printed to messages, so add volume uuid or dlm
    name to thread name can benefit problem analysis.

    Signed-off-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Gang He
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • In ocfs2_mknod_locked if '__ocfs2_mknod_locke d' returns an error, we
    should reclaim the inode successfully claimed above, otherwise, the
    inode never be reused. The case is described below:

    ocfs2_mknod
    ocfs2_mknod_locked
    ocfs2_claim_new_inode
    Successfully claim the inode
    __ocfs2_mknod_locked
    ocfs2_journal_access_di
    Failed because of -ENOMEM or other reasons, the inode
    lockres has not been initialized yet.

    iput(inode)
    ocfs2_evict_inode
    ocfs2_delete_inode
    ocfs2_inode_lock
    ocfs2_inode_lock_full_nested
    __ocfs2_cluster_lock
    Return -EINVAL because of the inode
    lockres has not been initialized.

    So the following operations are not performed
    ocfs2_wipe_inode
    ocfs2_remove_inode
    ocfs2_free_dinode
    ocfs2_free_suballoc_bits

    Signed-off-by: Alex Chen
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    alex chen
     
  • There is a race case between mount and delete node/cluster, which will
    lead o2hb_thread to malfunctioning dead loop.

    o2hb_thread
    {
    o2nm_depend_this_node();
    <<<<<< race window, node may have already been deleted, and then
    enter the loop, o2hb thread will be malfunctioning
    because of no configured nodes found.
    while (!kthread_should_stop() &&
    !reg->hr_unclean_stop && !reg->hr_aborted_start) {
    }

    So check the return value of o2nm_depend_this_node() is needed. If node
    has been deleted, do not enter the loop and let mount fail.

    Signed-off-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • We have no need to take inode mutex, rw and inode lock if it is not dio
    entry when recover orphans. Optimize it by adding a flag
    OCFS2_INODE_DIO_ORPHAN_ENTRY to ocfs2_inode_info to reduce contention.

    Signed-off-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • dio entry will only do truncate in case of ORPHAN_NEED_TRUNCATE. So do
    not include it when doing normal orphan scan to reduce contention.

    Signed-off-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • Currently cluster allocation is always trying to find a victim chain (a
    chian has most space), and this may lead to poor performance because of
    discontiguous allocation in some scenarios.

    Our test case is block size 4k, cluster size 1M and mount option with
    localalloc=2048 (2G), since a gd is 32256M (about 31.5G) and a localalloc
    window is only 2G, creating 50G file will result in 2G from gd0, 2G from
    gd1, ...

    One way to improve performance is enlarge localalloc window size (max
    31104M), but this will make end user feel that about 30G is suddenly
    "missing", and localalloc currently do not support steal, which means one
    node cannot use another node's localalloc even it is not used in fact. So
    using the last gd to record the allocation and continues with the gd if it
    has enough space for a localalloc window can make the allocation as more
    contiguous as possible.

    Our test result is below (evaluated in IOPS), which is using iometer
    running in VM, dynamic vhd virtual disk stored in ocfs2.

    IO model Original After Improved(%)
    16K60%Write100%Random 703 876 24.59%
    8K90%Write100%Random 735 827 12.59%
    4K100%Write100%Random 859 915 6.52%
    4K100%Read100%Random 2092 2600 24.30%

    Signed-off-by: Joseph Qi
    Tested-by: Norton Zhu
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • A simplified test case is (this case from Ryan):
    1) dd if=/dev/zero of=/mnt/hello bs=512 count=1 oflag=direct;
    2) truncate /mnt/hello -s 2097152
    file 'hello' is not exist before test. After this command,
    file 'hello' should be all zero. But 512~4096 is some random data.

    Setting bh state to new when get a new block, if so,
    direct_io_worker()->dio_zero_block() will fill-in the unused portion
    of the block with zero.

    Signed-off-by: Yiwen Jiang
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    jiangyiwen
     
  • If ocfs2_is_overwrite failed, ocfs2_direct_IO_write mays till return
    success to the caller.

    Signed-off-by: Norton.Zhu
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Norton.Zhu
     
  • Pull file locking updates from Jeff Layton:
    "The largest series of changes is from Ben who offered up a set to add
    a new helper function for setting locks based on the type set in
    fl_flags. Dmitry also send in a fix for a potential race that he
    found with KTSAN"

    * tag 'locks-v4.4-1' of git://git.samba.org/jlayton/linux:
    locks: cleanup posix_lock_inode_wait and flock_lock_inode_wait
    Move locks API users to locks_lock_inode_wait()
    locks: introduce locks_lock_inode_wait()
    locks: Use more file_inode and fix a comment
    fs: fix data races on inode->i_flctx
    locks: change tracepoint for generic_add_lease

    Linus Torvalds
     

23 Oct, 2015

2 commits


23 Sep, 2015

1 commit

  • The order of the following three spinlocks should be:
    dlm_domain_lock < dlm_ctxt->spinlock < dlm_lock_resource->spinlock

    But dlm_dispatch_assert_master() is called while holding
    dlm_ctxt->spinlock and dlm_lock_resource->spinlock, and then it calls
    dlm_grab() which will take dlm_domain_lock.

    Once another thread (for example, dlm_query_join_handler) has already
    taken dlm_domain_lock, and tries to take dlm_ctxt->spinlock deadlock
    happens.

    Signed-off-by: Joseph Qi
    Cc: Joel Becker
    Cc: Mark Fasheh
    Cc: "Junxiao Bi"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     

12 Sep, 2015

1 commit


05 Sep, 2015

25 commits

  • Many file systems that implement the show_options hook fail to correctly
    escape their output which could lead to unescaped characters (e.g. new
    lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files. This
    could lead to confusion, spoofed entries (resulting in things like
    systemd issuing false d-bus "mount" notifications), and who knows what
    else. This looks like it would only be the root user stepping on
    themselves, but it's possible weird things could happen in containers or
    in other situations with delegated mount privileges.

    Here's an example using overlay with setuid fusermount trusting the
    contents of /proc/mounts (via the /etc/mtab symlink). Imagine the use
    of "sudo" is something more sneaky:

    $ BASE="ovl"
    $ MNT="$BASE/mnt"
    $ LOW="$BASE/lower"
    $ UP="$BASE/upper"
    $ WORK="$BASE/work/ 0 0
    none /proc fuse.pwn user_id=1000"
    $ mkdir -p "$LOW" "$UP" "$WORK"
    $ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt
    $ cat /proc/mounts
    none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0
    none /proc fuse.pwn user_id=1000 0 0
    $ fusermount -u /proc
    $ cat /proc/mounts
    cat: /proc/mounts: No such file or directory

    This fixes the problem by adding new seq_show_option and
    seq_show_option_n helpers, and updating the vulnerable show_option
    handlers to use them as needed. Some, like SELinux, need to be open
    coded due to unusual existing escape mechanisms.

    [akpm@linux-foundation.org: add lost chunk, per Kees]
    [keescook@chromium.org: seq_show_option should be using const parameters]
    Signed-off-by: Kees Cook
    Acked-by: Serge Hallyn
    Acked-by: Jan Kara
    Acked-by: Paul Moore
    Cc: J. R. Okajima
    Signed-off-by: Kees Cook
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • NULL check before kfree is redundant and so clean them up.

    Signed-off-by: Joseph Qi
    Reviewed-by: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • These uses sometimes do and sometimes don't have '\n' terminations. Make
    the uses consistently use '\n' terminations and remove the newline from
    the functions.

    Miscellanea:

    o Coalesce formats
    o Realign arguments

    Signed-off-by: Joe Perches
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • While appending an extent to a file, it will call these functions:
    ocfs2_insert_extent

    -> call ocfs2_grow_tree() if there's no free rec
    -> ocfs2_add_branch add a new branch to extent tree,
    now rec[0] in the leaf of rightmost path is empty
    -> ocfs2_do_insert_extent
    -> ocfs2_rotate_tree_right
    -> ocfs2_extend_rotate_transaction
    -> jbd2_journal_restart if jbd2_journal_extend fail
    -> ocfs2_insert_path
    -> ocfs2_extend_trans
    -> jbd2_journal_restart if jbd2_journal_extend fail
    -> ocfs2_insert_at_leaf
    -> ocfs2_et_update_clusters
    Function jbd2_journal_restart() may be called and it may happened that
    buffers dirtied in ocfs2_add_branch() are committed
    while buffers dirtied in ocfs2_insert_at_leaf() and
    ocfs2_et_update_clusters() are not.
    So an empty rec[0] is left in rightmost path which will cause
    read-only filesystem when call ocfs2_commit_truncate()
    with the error message: "Inode %lu has an empty extent record".

    This is not a serious problem, so remove the rightmost path when call
    ocfs2_commit_truncate().

    Signed-off-by: joyce.xue
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xue jiufei
     
  • 1: After we call ocfs2_journal_access_di() in ocfs2_write_begin(),
    jbd2_journal_restart() may also be called, in this function transaction
    A's t_updates-- and obtains a new transaction B. If
    jbd2_journal_commit_transaction() is happened to commit transaction A,
    when t_updates==0, it will continue to complete commit and unfile
    buffer.

    So when jbd2_journal_dirty_metadata(), the handle is pointed a new
    transaction B, and the buffer head's journal head is already freed,
    jh->b_transaction == NULL, jh->b_next_transaction == NULL, it returns
    EINVAL, So it triggers the BUG_ON(status).

    thread 1 jbd2
    ocfs2_write_begin jbd2_journal_commit_transaction
    ocfs2_write_begin_nolock
    ocfs2_start_trans
    jbd2__journal_start(t_updates+1,
    transaction A)
    ocfs2_journal_access_di
    ocfs2_write_cluster_by_desc
    ocfs2_mark_extent_written
    ocfs2_change_extent_flag
    ocfs2_split_extent
    ocfs2_extend_rotate_transaction
    jbd2_journal_restart
    (t_updates-1,transaction B) t_updates==0
    __jbd2_journal_refile_buffer
    (jh->b_transaction = NULL)
    ocfs2_write_end
    ocfs2_write_end_nolock
    ocfs2_journal_dirty
    jbd2_journal_dirty_metadata(bug)
    ocfs2_commit_trans

    2. In ext4, I found that: jbd2_journal_get_write_access() called by
    ext4_write_end.

    ext4_write_begin
    ext4_journal_start
    __ext4_journal_start_sb
    ext4_journal_check_start
    jbd2__journal_start

    ext4_write_end
    ext4_mark_inode_dirty
    ext4_reserve_inode_write
    ext4_journal_get_write_access
    jbd2_journal_get_write_access
    ext4_mark_iloc_dirty
    ext4_do_update_inode
    ext4_handle_dirty_metadata
    jbd2_journal_dirty_metadata

    3. So I think we should put ocfs2_journal_access_di before
    ocfs2_journal_dirty in the ocfs2_write_end. and it works well after my
    modification.

    Signed-off-by: vicky
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Cc: Zhangguanghui
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    yangwenfang
     
  • o2hb_elapsed_msecs computes the time taken for a disk heartbeat.
    'struct timeval' variables are used to store start and end times. On
    32-bit systems, the 'tv_sec' component of 'struct timeval' will overflow
    in year 2038 and beyond.

    This patch solves the overflow with the following:

    1. Replace o2hb_elapsed_msecs using 'ktime_t' values to measure start
    and end time, and built-in function 'ktime_ms_delta' to compute the
    elapsed time. ktime_get_real() is used since the code prints out the
    wallclock time.

    2. Changes format string to print time as a single 64-bit nanoseconds
    value ("%lld") instead of seconds and microseconds. This simplifies
    the code since converting ktime_t to that format would need expensive
    computation. However, the debug log string is less readable than the
    previous format.

    Signed-off-by: Tina Ruchandani
    Suggested by: Arnd Bergmann
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tina Ruchandani
     
  • There is a race case between crashed dio and rm, which will lead to
    OCFS2_VALID_FL not set read-only.

    N1 N2
    ------------------------------------------------------------------------
    dd with direct flag
    rm file
    crashed with an dio entry left
    in orphan dir
    clear OCFS2_VALID_FL in
    ocfs2_remove_inode
    recover N1 and read the corrupted inode,
    and set filesystem read-only

    So we skip the inode deletion this time and wait for dio entry recovered
    first.

    Signed-off-by: Joseph Qi
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • The following case will lead to a lockres is freed but is still in use.

    cat /sys/kernel/debug/o2dlm/locking_state dlm_thread
    lockres_seq_start
    -> lock dlm->track_lock
    -> get resA
    resA->refs decrease to 0,
    call dlm_lockres_release,
    and wait for "cat" unlock.
    Although resA->refs is already set to 0,
    increase resA->refs, and then unlock
    lock dlm->track_lock
    -> list_del_init()
    -> unlock
    -> free resA

    In such a race case, invalid address access may occurs. So we should
    delete list res->tracking before resA->refs decrease to 0.

    Signed-off-by: Yiwen Jiang
    Reviewed-by: Joseph Qi
    Cc: Joel Becker
    Signed-off-by: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yiwen Jiang
     
  • This bug in mainline code is pointed out by Mark Fasheh. When
    ocfs2_iop_set_acl() and ocfs2_iop_get_acl() are entered from VFS layer,
    inode lock is not held. This seems to be regression from older kernels.
    The patch is to fix that.

    Orabug: 20189959
    Signed-off-by: Tariq Saeed
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tariq Saeed
     
  • PID: 614 TASK: ffff882a739da580 CPU: 3 COMMAND: "ocfs2dc"
    #0 [ffff882ecc3759b0] machine_kexec at ffffffff8103b35d
    #1 [ffff882ecc375a20] crash_kexec at ffffffff810b95b5
    #2 [ffff882ecc375af0] oops_end at ffffffff815091d8
    #3 [ffff882ecc375b20] die at ffffffff8101868b
    #4 [ffff882ecc375b50] do_trap at ffffffff81508bb0
    #5 [ffff882ecc375ba0] do_invalid_op at ffffffff810165e5
    #6 [ffff882ecc375c40] invalid_op at ffffffff815116fb
    [exception RIP: ocfs2_ci_checkpointed+208]
    RIP: ffffffffa0a7e940 RSP: ffff882ecc375cf0 RFLAGS: 00010002
    RAX: 0000000000000001 RBX: 000000000000654b RCX: ffff8812dc83f1f8
    RDX: 00000000000017d9 RSI: ffff8812dc83f1f8 RDI: ffffffffa0b2c318
    RBP: ffff882ecc375d20 R8: ffff882ef6ecfa60 R9: ffff88301f272200
    R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffffffffff
    R13: ffff8812dc83f4f0 R14: 0000000000000000 R15: ffff8812dc83f1f8
    ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
    #7 [ffff882ecc375d28] ocfs2_check_meta_downconvert at ffffffffa0a7edbd [ocfs2]
    #8 [ffff882ecc375d38] ocfs2_unblock_lock at ffffffffa0a84af8 [ocfs2]
    #9 [ffff882ecc375dc8] ocfs2_process_blocked_lock at ffffffffa0a85285 [ocfs2]
    #10 [ffff882ecc375e18] ocfs2_downconvert_thread_do_work at ffffffffa0a85445 [ocfs2]
    #11 [ffff882ecc375e68] ocfs2_downconvert_thread at ffffffffa0a854de [ocfs2]
    #12 [ffff882ecc375ee8] kthread at ffffffff81090da7
    #13 [ffff882ecc375f48] kernel_thread_helper at ffffffff81511884
    assert is tripped because the tran is not checkpointed and the lock level is PR.

    Some time ago, chmod command had been executed. As result, the following call
    chain left the inode cluster lock in PR state, latter on causing the assert.
    system_call_fastpath
    -> my_chmod
    -> sys_chmod
    -> sys_fchmodat
    -> notify_change
    -> ocfs2_setattr
    -> posix_acl_chmod
    -> ocfs2_iop_set_acl
    -> ocfs2_set_acl
    -> ocfs2_acl_set_mode
    Here is how.
    1119 int ocfs2_setattr(struct dentry *dentry, struct iattr *attr)
    1120 {
    1247 ocfs2_inode_unlock(inode, 1); <<< WRONG thing to do.
    ..
    1258 if (!status && attr->ia_valid & ATTR_MODE) {
    1259 status = posix_acl_chmod(inode, inode->i_mode);

    519 posix_acl_chmod(struct inode *inode, umode_t mode)
    520 {
    ..
    539 ret = inode->i_op->set_acl(inode, acl, ACL_TYPE_ACCESS);

    287 int ocfs2_iop_set_acl(struct inode *inode, struct posix_acl *acl, ...
    288 {
    289 return ocfs2_set_acl(NULL, inode, NULL, type, acl, NULL, NULL);

    224 int ocfs2_set_acl(handle_t *handle,
    225 struct inode *inode, ...
    231 {
    ..
    252 ret = ocfs2_acl_set_mode(inode, di_bh,
    253 handle, mode);

    168 static int ocfs2_acl_set_mode(struct inode *inode, struct buffer_head ...
    170 {
    183 if (handle == NULL) {
    >>> BUG: inode lock not held in ex at this point <<<
    184 handle = ocfs2_start_trans(OCFS2_SB(inode->i_sb),
    185 OCFS2_INODE_UPDATE_CREDITS);

    ocfs2_setattr.#1247 we unlock and at #1259 call posix_acl_chmod. When we reach
    ocfs2_acl_set_mode.#181 and do trans, the inode cluster lock is not held in EX
    mode (it should be). How this could have happended?

    We are the lock master, were holding lock EX and have released it in
    ocfs2_setattr.#1247. Note that there are no holders of this lock at
    this point. Another node needs the lock in PR, and we downconvert from
    EX to PR. So the inode lock is PR when do the trans in
    ocfs2_acl_set_mode.#184. The trans stays in core (not flushed to disc).
    Now another node want the lock in EX, downconvert thread gets kicked
    (the one that tripped assert abovt), finds an unflushed trans but the
    lock is not EX (it is PR). If the lock was at EX, it would have flushed
    the trans ocfs2_ci_checkpointed -> ocfs2_start_checkpoint before
    downconverting (to NULL) for the request.

    ocfs2_setattr must not drop inode lock ex in this code path. If it
    does, takes it again before the trans, say in ocfs2_set_acl, another
    cluster node can get in between, execute another setattr, overwriting
    the one in progress on this node, resulting in a mode acl size combo
    that is a mix of the two.

    Orabug: 20189959
    Signed-off-by: Tariq Saeed
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tariq Saeed
     
  • Currently error handling in dlm_request_join is a little obscure, so
    optimize it to promote readability.

    If packet.code is invalid, reset it to JOIN_DISALLOW to keep it
    meaningful. It only influences the log printing.

    Signed-off-by: Norton.Zhu
    Cc: Srinivas Eeda
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Norton.Zhu
     
  • When running dirop_fileop_racer we found a case that inode
    can not removed.

    Two nodes, say Node A and Node B, mount the same ocfs2 volume. Create
    two dirs /race/1/ and /race/2/ in the filesystem.

    Node A Node B
    rm -r /race/2/
    mv /race/1/ /race/2/
    call ocfs2_unlink(), get
    the EX mode of /race/2/
    wait for B unlock /race/2/
    decrease i_nlink of /race/2/ to 0,
    and add inode of /race/2/ into
    orphan dir, unlock /race/2/
    got EX mode of /race/2/. because
    /race/1/ is dir, so inc i_nlink
    of /race/2/ and update into disk,
    unlock /race/2/
    because i_nlink of /race/2/
    is not zero, this inode will
    always remain in orphan dir

    This patch fixes this case by test whether i_nlink of new dir is zero.

    Signed-off-by: Yiwen Jiang
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Cc: Joseph Qi
    Cc: Xue jiufei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yiwen Jiang
     
  • In ocfs2, ip_alloc_sem is used to protect allocation changes on the
    node. In direct IO, we add ip_alloc_sem to protect date consistent
    between direct-io and ocfs2_truncate_file race (buffer io use
    ip_alloc_sem already). Although inode->i_mutex lock is used to avoid
    concurrency of above situation, i think ip_alloc_sem is still needed
    because protect allocation changes is significant.

    Other filesystem like ext4 also uses rw_semaphore to protect data
    consistent between get_block-vs-truncate race by other means, So
    ip_alloc_sem in ocfs2 direct io is needed.

    Signed-off-by: Weiwei Wang
    Signed-off-by: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    WeiWei Wang
     
  • In case a validation fails, clear the rest of the buffers and return the
    error to the calling function.

    This also facilitates bubbling up the error originating from ocfs2_error
    to calling functions.

    Signed-off-by: Goldwyn Rodrigues
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Goldwyn Rodrigues
     
  • Caveat: This may return -EROFS for a read case, which seems wrong. This
    is happening even without this patch series though. Should we convert
    EROFS to EIO?

    Signed-off-by: Goldwyn Rodrigues
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Goldwyn Rodrigues
     
  • OCFS2 is often used in high-availaibility systems. However, ocfs2
    converts the filesystem to read-only at the drop of the hat. This may
    not be necessary, since turning the filesystem read-only would affect
    other running processes as well, decreasing availability.

    This attempt is to add errors=continue, which would return the EIO to
    the calling process and terminate furhter processing so that the
    filesystem is not corrupted further. However, the filesystem is not
    converted to read-only.

    As a future plan, I intend to create a small utility or extend
    fsck.ocfs2 to fix small errors such as in the inode. The input to the
    utility such as the inode can come from the kernel logs so we don't have
    to schedule a downtime for fixing small-enough errors.

    The patch changes the ocfs2_error to return an error. The error
    returned depends on the mount option set. If none is set, the default
    is to turn the filesystem read-only.

    Perhaps errors=continue is not the best option name. Historically it is
    used for making an attempt to progress in the current process itself.
    Should we call it errors=eio? or errors=killproc? Suggestions/Comments
    welcome.

    Sources are available at:
    https://github.com/goldwynr/linux/tree/error-cont

    Signed-off-by: Goldwyn Rodrigues
    Signed-off-by: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Goldwyn Rodrigues
     
  • Disk inode deletion may be heavily delayed when one node unlink a file
    after the same dentry is freed on another node(say N1) because of memory
    shrink but inode is left in memory. This inode can only be freed while
    N1 doing the orphan scan work.

    However, N1 may skip orphan scan for several times because other nodes
    may do the work earlier. In our tests, it may take 1 hour on 4 nodes
    cluster and it hurts the user experience. So we think the inode should
    be freed after the data flushed to disk when i_count becomes zero to
    avoid such circumstances.

    Signed-off-by: Joyce.xue
    Cc: Joel Becker
    Reviewed-by: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xue jiufei
     
  • The trusted extended attributes are only visible to the process which
    hvae CAP_SYS_ADMIN capability but the check is missing in ocfs2
    xattr_handler trusted list. The check is important because this will be
    used for implementing mechanisms in the userspace for which other
    ordinary processes should not have access to.

    Signed-off-by: Sanidhya Kashyap
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Cc: Taesoo kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sanidhya Kashyap
     
  • In ocfs2_rename, it will lead to an inode with two entried(old and new) if
    ocfs2_delete_entry(old) failed. Thus, filesystem will be inconsistent.

    The case is described below:

    ocfs2_rename
    -> ocfs2_start_trans
    -> ocfs2_add_entry(new)
    -> ocfs2_delete_entry(old)
    -> __ocfs2_journal_access *failed* because of -ENOMEM
    -> ocfs2_commit_trans

    So filesystem should be set to read-only at the moment.

    Signed-off-by: Yiwen Jiang
    Cc: Joseph Qi
    Cc: Joel Becker
    Reviewed-by: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    jiangyiwen
     
  • Use list_for_each_entry instead of list_for_each to simplify code.

    Signed-off-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • The last goto statement is unneeded, so remove it.

    Signed-off-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • In dlm_register_domain_handlers, if o2hb_register_callback fails, it
    will call dlm_unregister_domain_handlers to unregister. This will
    trigger the BUG_ON in o2hb_unregister_callback because hc_magic is 0.
    So we should call o2hb_setup_callback to initialize hc first.

    Signed-off-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • status is already initialized and it will only be 0 or negatives in the
    code flow. So remove the unneeded assignment after the lable 'local'.

    Signed-off-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • Unlocking order in ocfs2_unlink and ocfs2_rename mismatches the
    corresponding locking order, although it won't cause issues, adjust the
    code so that it looks more reasonable.

    Signed-off-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • Since commit 86b9c6f3f891 ("ocfs2: remove filesize checks for sync I/O
    journal commit") removes filesize checks for sync I/O journal commit,
    variables old_size and old_clusters are not actually used any more. So
    clean them up.

    Signed-off-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi