18 Jan, 2020

1 commit

  • [ Upstream commit 397eac17f86f404f5ba31d8c3e39ec3124b39fd3 ]

    If journal is dirty when mount, it will be replayed but jbd2 sb log tail
    cannot be updated to mark a new start because journal->j_flag has
    already been set with JBD2_ABORT first in journal_init_common.

    When a new transaction is committed, it will be recored in block 1
    first(journal->j_tail is set to 1 in journal_reset). If emergency
    restart happens again before journal super block is updated
    unfortunately, the new recorded trans will not be replayed in the next
    mount.

    The following steps describe this procedure in detail.
    1. mount and touch some files
    2. these transactions are committed to journal area but not checkpointed
    3. emergency restart
    4. mount again and its journals are replayed
    5. journal super block's first s_start is 1, but its s_seq is not updated
    6. touch a new file and its trans is committed but not checkpointed
    7. emergency restart again
    8. mount and journal is dirty, but trans committed in 6 will not be
    replayed.

    This exception happens easily when this lun is used by only one node.
    If it is used by multi-nodes, other node will replay its journal and its
    journal super block will be updated after recovery like what this patch
    does.

    ocfs2_recover_node->ocfs2_replay_journal.

    The following jbd2 journal can be generated by touching a new file after
    journal is replayed, and seq 15 is the first valid commit, but first seq
    is 13 in journal super block.

    logdump:
    Block 0: Journal Superblock
    Seq: 0 Type: 4 (JBD2_SUPERBLOCK_V2)
    Blocksize: 4096 Total Blocks: 32768 First Block: 1
    First Commit ID: 13 Start Log Blknum: 1
    Error: 0
    Feature Compat: 0
    Feature Incompat: 2 block64
    Feature RO compat: 0
    Journal UUID: 4ED3822C54294467A4F8E87D2BA4BC36
    FS Share Cnt: 1 Dynamic Superblk Blknum: 0
    Per Txn Block Limit Journal: 0 Data: 0

    Block 1: Journal Commit Block
    Seq: 14 Type: 2 (JBD2_COMMIT_BLOCK)

    Block 2: Journal Descriptor
    Seq: 15 Type: 1 (JBD2_DESCRIPTOR_BLOCK)
    No. Blocknum Flags
    0. 587 none
    UUID: 00000000000000000000000000000000
    1. 8257792 JBD2_FLAG_SAME_UUID
    2. 619 JBD2_FLAG_SAME_UUID
    3. 24772864 JBD2_FLAG_SAME_UUID
    4. 8257802 JBD2_FLAG_SAME_UUID
    5. 513 JBD2_FLAG_SAME_UUID JBD2_FLAG_LAST_TAG
    ...
    Block 7: Inode
    Inode: 8257802 Mode: 0640 Generation: 57157641 (0x3682809)
    FS Generation: 2839773110 (0xa9437fb6)
    CRC32: 00000000 ECC: 0000
    Type: Regular Attr: 0x0 Flags: Valid
    Dynamic Features: (0x1) InlineData
    User: 0 (root) Group: 0 (root) Size: 7
    Links: 1 Clusters: 0
    ctime: 0x5de5d870 0x11104c61 -- Tue Dec 3 11:37:20.286280801 2019
    atime: 0x5de5d870 0x113181a1 -- Tue Dec 3 11:37:20.288457121 2019
    mtime: 0x5de5d870 0x11104c61 -- Tue Dec 3 11:37:20.286280801 2019
    dtime: 0x0 -- Thu Jan 1 08:00:00 1970
    ...
    Block 9: Journal Commit Block
    Seq: 15 Type: 2 (JBD2_COMMIT_BLOCK)

    The following is journal recovery log when recovering the upper jbd2
    journal when mount again.

    syslog:
    ocfs2: File system on device (252,1) was not unmounted cleanly, recovering it.
    fs/jbd2/recovery.c:(do_one_pass, 449): Starting recovery pass 0
    fs/jbd2/recovery.c:(do_one_pass, 449): Starting recovery pass 1
    fs/jbd2/recovery.c:(do_one_pass, 449): Starting recovery pass 2
    fs/jbd2/recovery.c:(jbd2_journal_recover, 278): JBD2: recovery, exit status 0, recovered transactions 13 to 13

    Due to first commit seq 13 recorded in journal super is not consistent
    with the value recorded in block 1(seq is 14), journal recovery will be
    terminated before seq 15 even though it is an unbroken commit, inode
    8257802 is a new file and it will be lost.

    Link: http://lkml.kernel.org/r/20191217020140.2197-1-li.kai4@h3c.com
    Signed-off-by: Kai Li
    Reviewed-by: Joseph Qi
    Reviewed-by: Changwei Ge
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Gang He
    Cc: Jun Piao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Kai Li
     

19 Oct, 2019

1 commit

  • mount.ocfs2 failed when reading ocfs2 filesystem superblock encounters
    an error. ocfs2_initialize_super() returns before allocating ocfs2_wq.
    ocfs2_dismount_volume() triggers the following panic.

    Oct 15 16:09:27 cnwarekv-205120 kernel: On-disk corruption discovered.Please run fsck.ocfs2 once the filesystem is unmounted.
    Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_read_locked_inode:537 ERROR: status = -30
    Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_init_global_system_inodes:458 ERROR: status = -30
    Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_init_global_system_inodes:491 ERROR: status = -30
    Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_initialize_super:2313 ERROR: status = -30
    Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_fill_super:1033 ERROR: status = -30
    ------------[ cut here ]------------
    Oops: 0002 [#1] SMP NOPTI
    CPU: 1 PID: 11753 Comm: mount.ocfs2 Tainted: G E
    4.14.148-200.ckv.x86_64 #1
    Hardware name: Sugon H320-G30/35N16-US, BIOS 0SSDX017 12/21/2018
    task: ffff967af0520000 task.stack: ffffa5f05484000
    RIP: 0010:mutex_lock+0x19/0x20
    Call Trace:
    flush_workqueue+0x81/0x460
    ocfs2_shutdown_local_alloc+0x47/0x440 [ocfs2]
    ocfs2_dismount_volume+0x84/0x400 [ocfs2]
    ocfs2_fill_super+0xa4/0x1270 [ocfs2]
    ? ocfs2_initialize_super.isa.211+0xf20/0xf20 [ocfs2]
    mount_bdev+0x17f/0x1c0
    mount_fs+0x3a/0x160

    Link: http://lkml.kernel.org/r/1571139611-24107-1-git-send-email-yili@winhong.com
    Signed-off-by: Yi Li
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Changwei Ge
    Cc: Gang He
    Cc: Jun Piao
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yi Li
     

31 May, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your option any later version this program is distributed in the
    hope that it will be useful but without any warranty without even
    the implied warranty of merchantability or fitness for a particular
    purpose see the gnu general public license for more details you
    should have received a copy of the gnu general public license along
    with this program if not write to the free software foundation inc
    59 temple place suite 330 boston ma 021110 1307 usa

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-or-later

    has been chosen to replace the boilerplate/reference in 84 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Richard Fontana
    Reviewed-by: Allison Randal
    Reviewed-by: Kate Stewart
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190524100844.756442981@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

29 Dec, 2018

1 commit

  • Dirty flag of the journal should be cleared at the last stage of umount,
    if do it before jbd2_journal_destroy(), then some metadata in uncommitted
    transaction could be lost due to io error, but as dirty flag of journal
    was already cleared, we can't find that until run a full fsck. This may
    cause system panic or other corruption.

    Link: http://lkml.kernel.org/r/20181121020023.3034-3-junxiao.bi@oracle.com
    Signed-off-by: Junxiao Bi
    Reviewed-by: Yiwen Jiang
    Reviewed-by: Joseph Qi
    Cc: Jun Piao
    Cc: Changwei Ge
    Cc: Joel Becker
    Cc: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junxiao Bi
     

04 Nov, 2018

1 commit

  • During one dead node's recovery by other node, quota recovery work will
    be queued. We should avoid calling quota when it is not supported, so
    check the quota flags.

    Link: http://lkml.kernel.org/r/71604351584F6A4EBAE558C676F37CA401071AC9FB@H3CMLB12-EX.srv.huawei-3com.com
    Signed-off-by: guozhonghua
    Reviewed-by: Jan Kara
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Cc: Changwei Ge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Guozhonghua
     

13 Jun, 2018

1 commit

  • The kzalloc() function has a 2-factor argument form, kcalloc(). This
    patch replaces cases of:

    kzalloc(a * b, gfp)

    with:
    kcalloc(a * b, gfp)

    as well as handling cases of:

    kzalloc(a * b * c, gfp)

    with:

    kzalloc(array3_size(a, b, c), gfp)

    as it's slightly less ugly than:

    kzalloc_array(array_size(a, b), c, gfp)

    This does, however, attempt to ignore constant size factors like:

    kzalloc(4 * 1024, gfp)

    though any constants defined via macros get caught up in the conversion.

    Any factors with a sizeof() of "unsigned char", "char", and "u8" were
    dropped, since they're redundant.

    The Coccinelle script used for this was:

    // Fix redundant parens around sizeof().
    @@
    type TYPE;
    expression THING, E;
    @@

    (
    kzalloc(
    - (sizeof(TYPE)) * E
    + sizeof(TYPE) * E
    , ...)
    |
    kzalloc(
    - (sizeof(THING)) * E
    + sizeof(THING) * E
    , ...)
    )

    // Drop single-byte sizes and redundant parens.
    @@
    expression COUNT;
    typedef u8;
    typedef __u8;
    @@

    (
    kzalloc(
    - sizeof(u8) * (COUNT)
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(__u8) * (COUNT)
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(char) * (COUNT)
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(unsigned char) * (COUNT)
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(u8) * COUNT
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(__u8) * COUNT
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(char) * COUNT
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(unsigned char) * COUNT
    + COUNT
    , ...)
    )

    // 2-factor product with sizeof(type/expression) and identifier or constant.
    @@
    type TYPE;
    expression THING;
    identifier COUNT_ID;
    constant COUNT_CONST;
    @@

    (
    - kzalloc
    + kcalloc
    (
    - sizeof(TYPE) * (COUNT_ID)
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(TYPE) * COUNT_ID
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(TYPE) * (COUNT_CONST)
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(TYPE) * COUNT_CONST
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(THING) * (COUNT_ID)
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(THING) * COUNT_ID
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(THING) * (COUNT_CONST)
    + COUNT_CONST, sizeof(THING)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(THING) * COUNT_CONST
    + COUNT_CONST, sizeof(THING)
    , ...)
    )

    // 2-factor product, only identifiers.
    @@
    identifier SIZE, COUNT;
    @@

    - kzalloc
    + kcalloc
    (
    - SIZE * COUNT
    + COUNT, SIZE
    , ...)

    // 3-factor product with 1 sizeof(type) or sizeof(expression), with
    // redundant parens removed.
    @@
    expression THING;
    identifier STRIDE, COUNT;
    type TYPE;
    @@

    (
    kzalloc(
    - sizeof(TYPE) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kzalloc(
    - sizeof(TYPE) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kzalloc(
    - sizeof(TYPE) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kzalloc(
    - sizeof(TYPE) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kzalloc(
    - sizeof(THING) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kzalloc(
    - sizeof(THING) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kzalloc(
    - sizeof(THING) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kzalloc(
    - sizeof(THING) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    )

    // 3-factor product with 2 sizeof(variable), with redundant parens removed.
    @@
    expression THING1, THING2;
    identifier COUNT;
    type TYPE1, TYPE2;
    @@

    (
    kzalloc(
    - sizeof(TYPE1) * sizeof(TYPE2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kzalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kzalloc(
    - sizeof(THING1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kzalloc(
    - sizeof(THING1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kzalloc(
    - sizeof(TYPE1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    |
    kzalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    )

    // 3-factor product, only identifiers, with redundant parens removed.
    @@
    identifier STRIDE, SIZE, COUNT;
    @@

    (
    kzalloc(
    - (COUNT) * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - COUNT * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - COUNT * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - (COUNT) * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - COUNT * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - (COUNT) * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - (COUNT) * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - COUNT * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    )

    // Any remaining multi-factor products, first at least 3-factor products,
    // when they're not all constants...
    @@
    expression E1, E2, E3;
    constant C1, C2, C3;
    @@

    (
    kzalloc(C1 * C2 * C3, ...)
    |
    kzalloc(
    - (E1) * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kzalloc(
    - (E1) * (E2) * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kzalloc(
    - (E1) * (E2) * (E3)
    + array3_size(E1, E2, E3)
    , ...)
    |
    kzalloc(
    - E1 * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    )

    // And then all remaining 2 factors products when they're not all constants,
    // keeping sizeof() as the second factor argument.
    @@
    expression THING, E1, E2;
    type TYPE;
    constant C1, C2, C3;
    @@

    (
    kzalloc(sizeof(THING) * C2, ...)
    |
    kzalloc(sizeof(TYPE) * C2, ...)
    |
    kzalloc(C1 * C2 * C3, ...)
    |
    kzalloc(C1 * C2, ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(TYPE) * (E2)
    + E2, sizeof(TYPE)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(TYPE) * E2
    + E2, sizeof(TYPE)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(THING) * (E2)
    + E2, sizeof(THING)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(THING) * E2
    + E2, sizeof(THING)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - (E1) * E2
    + E1, E2
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - (E1) * (E2)
    + E1, E2
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - E1 * E2
    + E1, E2
    , ...)
    )

    Signed-off-by: Kees Cook

    Kees Cook
     

01 Feb, 2018

1 commit

  • We should not reuse the dirty bh in jbd2 directly due to the following
    situation:

    1. When removing extent rec, we will dirty the bhs of extent rec and
    truncate log at the same time, and hand them over to jbd2.

    2. The bhs are submitted to jbd2 area successfully.

    3. The write-back thread of device help flush the bhs to disk but
    encounter write error due to abnormal storage link.

    4. After a while the storage link become normal. Truncate log flush
    worker triggered by the next space reclaiming found the dirty bh of
    truncate log and clear its 'BH_Write_EIO' and then set it uptodate in
    __ocfs2_journal_access():

    ocfs2_truncate_log_worker
    ocfs2_flush_truncate_log
    __ocfs2_flush_truncate_log
    ocfs2_replay_truncate_records
    ocfs2_journal_access_di
    __ocfs2_journal_access // here we clear io_error and set 'tl_bh' uptodata.

    5. Then jbd2 will flush the bh of truncate log to disk, but the bh of
    extent rec is still in error state, and unfortunately nobody will
    take care of it.

    6. At last the space of extent rec was not reduced, but truncate log
    flush worker have given it back to globalalloc. That will cause
    duplicate cluster problem which could be identified by fsck.ocfs2.

    Sadly we can hardly revert this but set fs read-only in case of ruining
    atomicity and consistency of space reclaim.

    Link: http://lkml.kernel.org/r/5A6E8092.8090701@huawei.com
    Fixes: acf8fdbe6afb ("ocfs2: do not BUG if buffer not uptodate in __ocfs2_journal_access")
    Signed-off-by: Jun Piao
    Reviewed-by: Yiwen Jiang
    Reviewed-by: Changwei Ge
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    piaojun
     

07 Sep, 2017

1 commit

  • clean up some unused functions and parameters.

    Link: http://lkml.kernel.org/r/598A5E21.2080807@huawei.com
    Signed-off-by: Jun Piao
    Reviewed-by: Alex Chen
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jun Piao
     

13 Dec, 2016

1 commit

  • struct timespec is not y2038 safe. Use time64_t which is y2038 safe to
    represent orphan scan times. time64_t is sufficient here as only the
    seconds delta times are relevant.

    Also use appropriate time functions that return time in time64_t format.
    Time functions now return monotonic time instead of real time as only
    delta scan times are relevant and these values are not persistent across
    reboots.

    The format string for the debug print is still using long as this is
    only the time elapsed since the last scan and long is sufficient to
    represent this value.

    Link: http://lkml.kernel.org/r/1475365138-20567-1-git-send-email-deepa.kernel@gmail.com
    Signed-off-by: Deepa Dinamani
    Reviewed-by: Arnd Bergmann
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Deepa Dinamani
     

27 Jul, 2016

1 commit

  • Journal replay will be run when performing recovery for a dead node. To
    avoid the stale cache impact, all blocks of dead node's journal inode
    were reloaded from disk. This hurts the performance. Check whether one
    block is cached before reloading it can improve performance a lot. In
    my test env, the time doing recovery was improved from 120s to 1s.

    [akpm@linux-foundation.org: clean up the for loop p_blkno handling]
    Link: http://lkml.kernel.org/r/1466155682-24656-1-git-send-email-junxiao.bi@oracle.com
    Signed-off-by: Junxiao Bi
    Reviewed-by: Joseph Qi
    Cc: "Gang He"
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junxiao Bi
     

26 Mar, 2016

1 commit

  • This patch fixes a deadlock, as follows:

    Node 1 Node 2 Node 3
    1)volume a and b are only mount vol a only mount vol b
    mounted

    2) start to mount b start to mount a

    3) check hb of Node 3 check hb of Node 2
    in vol a, qs_holds++ in vol b, qs_holds++

    4) -------------------- all nodes' network down --------------------

    5) progress of mount b the same situation as
    failed, and then call Node 2
    ocfs2_dismount_volume.
    but the process is hung,
    since there is a work
    in ocfs2_wq cannot beo
    completed. This work is
    about vol a, because
    ocfs2_wq is global wq.
    BTW, this work which is
    scheduled in ocfs2_wq is
    ocfs2_orphan_scan_work,
    and the context in this work
    needs to take inode lock
    of orphan_dir, because
    lockres owner are Node 1 and
    all nodes' nework has been down
    at the same time, so it can't
    get the inode lock.

    6) Why can't this node be fenced
    when network disconnected?
    Because the process of
    mount is hung what caused qs_holds
    is not equal 0.

    Because all works in the ocfs2_wq are relative to the super block.

    The solution is to change the ocfs2_wq from global to local. In other
    words, move it into struct ocfs2_super.

    Signed-off-by: Yiwen Jiang
    Reviewed-by: Joseph Qi
    Cc: Xue jiufei
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Cc: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    jiangyiwen
     

23 Jan, 2016

1 commit

  • parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
    inode_foo(inode) being mutex_foo(&inode->i_mutex).

    Please, use those for access to ->i_mutex; over the coming cycle
    ->i_mutex will become rwsem, with ->lookup() done with it held
    only shared.

    Signed-off-by: Al Viro

    Al Viro
     

15 Jan, 2016

1 commit


06 Nov, 2015

3 commits

  • A node can mount multiple ocfs2 volumes. And if thread names are same for
    each volume/domain, it will bring inconvenience when analyzing problems
    because we have to identify which volume/domain the messages belong to.

    Since thread name will be printed to messages, so add volume uuid or dlm
    name to thread name can benefit problem analysis.

    Signed-off-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Gang He
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • We have no need to take inode mutex, rw and inode lock if it is not dio
    entry when recover orphans. Optimize it by adding a flag
    OCFS2_INODE_DIO_ORPHAN_ENTRY to ocfs2_inode_info to reduce contention.

    Signed-off-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • dio entry will only do truncate in case of ORPHAN_NEED_TRUNCATE. So do
    not include it when doing normal orphan scan to reduce contention.

    Signed-off-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     

05 Sep, 2015

4 commits

  • These uses sometimes do and sometimes don't have '\n' terminations. Make
    the uses consistently use '\n' terminations and remove the newline from
    the functions.

    Miscellanea:

    o Coalesce formats
    o Realign arguments

    Signed-off-by: Joe Perches
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • There is a race case between crashed dio and rm, which will lead to
    OCFS2_VALID_FL not set read-only.

    N1 N2
    ------------------------------------------------------------------------
    dd with direct flag
    rm file
    crashed with an dio entry left
    in orphan dir
    clear OCFS2_VALID_FL in
    ocfs2_remove_inode
    recover N1 and read the corrupted inode,
    and set filesystem read-only

    So we skip the inode deletion this time and wait for dio entry recovered
    first.

    Signed-off-by: Joseph Qi
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • When storage network is unstable, it may trigger the BUG in
    __ocfs2_journal_access because of buffer not uptodate. We can retry the
    write in this case or return error instead of BUG.

    Signed-off-by: Joseph Qi
    Reported-by: Zhangguanghui
    Tested-by: Zhangguanghui
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • During direct io the inode will be added to orphan first and then
    deleted from orphan. There is a race window that the orphan entry will
    be deleted twice and thus trigger the BUG when validating
    OCFS2_DIO_ORPHANED_FL in ocfs2_del_inode_from_orphan.

    ocfs2_direct_IO_write
    ...
    ocfs2_add_inode_to_orphan
    >>>>>>>> race window.
    1) another node may rm the file and then down, this node
    take care of orphan recovery and clear flag
    OCFS2_DIO_ORPHANED_FL.
    2) since rw lock is unlocked, it may race with another
    orphan recovery and append dio.
    ocfs2_del_inode_from_orphan

    So take inode mutex lock when recovering orphans and make rw unlock at the
    end of aio write in case of append dio.

    Signed-off-by: Joseph Qi
    Reported-by: Yiwen Jiang
    Cc: Weiwei Wang
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     

25 Jun, 2015

4 commits

  • Some functions are only used locally, so mark them as static.

    Signed-off-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • ocfs2_abort_trigger() use bh->b_assoc_map to get sb. But there's no
    function to set bh->b_assoc_map in ocfs2, it will trigger NULL pointer
    dereference while calling this function. We can get sb from
    bh->b_bdev->bd_super instead of b_assoc_map.

    [akpm@linux-foundation.org: update comment, per Joseph]
    Signed-off-by: joyce.xue
    Cc: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xue jiufei
     
  • jbd2_journal_dirty_metadata may fail. Currently it cannot take care of
    non zero return value and just BUG in ocfs2_journal_dirty. This patch is
    aborting the handle and journal instead of BUG.

    Signed-off-by: Joseph Qi
    Cc: joyce.xue
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • Once dio crashed it will leave an entry in orphan dir. And orphan scan
    will take care of the clean up. There is a tiny race case that the same
    entry will be truncated twice and then trigger the BUG in
    ocfs2_del_inode_from_orphan.

    Signed-off-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     

17 Feb, 2015

2 commits

  • If one node has crashed with orphan entry leftover, another node which do
    append O_DIRECT write to the same file will override the
    i_dio_orphaned_slot. Then the old entry won't be cleaned forever. If
    this case happens, we let it wait for orphan recovery first.

    Signed-off-by: Joseph Qi
    Cc: Weiwei Wang
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Mark Fasheh
    Cc: Xuejiufei
    Cc: alex chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • Define two orphan recovery types, which indicates if need truncate file or
    not.

    Signed-off-by: Joseph Qi
    Cc: Weiwei Wang
    Cc: Junxiao Bi
    Cc: Joel Becker
    Cc: Mark Fasheh
    Cc: Xuejiufei
    Cc: alex chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     

11 Feb, 2015

1 commit


01 Nov, 2014

1 commit


05 Jun, 2014

1 commit

  • Once JBD2_ABORT is set, ocfs2_commit_cache will fail in
    ocfs2_commit_thread. Then it will get into a loop with mass logs. This
    will meaninglessly consume a larger number of resource and may lead to
    the system hanging. So limit printk in this case.

    [akpm@linux-foundation.org: document the msleep]
    Signed-off-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     

04 Apr, 2014

1 commit


12 Sep, 2013

2 commits

  • Though ocfs2 uses inode->i_mutex to protect i_size, there are both
    i_size_read/write() and direct accesses. Clean up all direct access to
    eliminate confusion.

    Signed-off-by: Junxiao Bi
    Cc: Jie Liu
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junxiao Bi
     
  • The issue scenario is as following:

    When fallocating a very large disk space for a small file,
    __ocfs2_extend_allocation attempts to get a very large transaction. For
    some journal sizes, there may be not enough room for this transaction,
    and the fallocate will fail.

    The patch below extends & restarts the transaction as necessary while
    allocating space, and should work with even the smallest journal. This
    patch refers ext4 resize.

    Test:
    # mkfs.ocfs2 -b 4K -C 32K -T datafiles /dev/sdc
    ...(jounral size is 32M)
    # mount.ocfs2 /dev/sdc /mnt/ocfs2/
    # touch /mnt/ocfs2/1.log
    # fallocate -o 0 -l 400G /mnt/ocfs2/1.log
    fallocate: /mnt/ocfs2/1.log: fallocate failed: Cannot allocate memory
    # tail -f /var/log/messages
    [ 7372.278591] JBD: fallocate wants too many credits (2051 > 2048)
    [ 7372.278597] (fallocate,6438,0):__ocfs2_extend_allocation:709 ERROR: status = -12
    [ 7372.278603] (fallocate,6438,0):ocfs2_allocate_unwritten_extents:1504 ERROR: status = -12
    [ 7372.278607] (fallocate,6438,0):__ocfs2_change_file_space:1955 ERROR: status = -12
    ^C
    With this patch, the test works well.

    Signed-off-by: Younger Liu
    Cc: Jie Liu
    Cc: Joel Becker
    Cc: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Younger Liu
     

29 Jun, 2013

1 commit


22 Feb, 2013

1 commit

  • smatch analysis indicates a number of redundant NULL checks before
    calling kfree(), eg:

    fs/ocfs2/alloc.c:6138 ocfs2_begin_truncate_log_recovery() info:
    redundant null check on *tl_copy calling kfree()

    fs/ocfs2/alloc.c:6755 ocfs2_zero_range_for_truncate() info:
    redundant null check on pages calling kfree()

    etc....

    [akpm@linux-foundation.org: revert dubious change in ocfs2_begin_truncate_log_recovery()]
    Signed-off-by: Tim Gardner
    Cc: Mark Fasheh
    Acked-by: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tim Gardner
     

31 Jul, 2012

1 commit

  • Protect ocfs2_page_mkwrite() and ocfs2_file_aio_write() using the new freeze
    protection. We also protect several ioctl entry points which were missing the
    protection. Finally, we add freeze protection to the journaling mechanism so
    that iput() of unlinked inode cannot modify a frozen filesystem.

    CC: Mark Fasheh
    CC: Joel Becker
    CC: ocfs2-devel@oss.oracle.com
    Acked-by: Joel Becker
    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     

25 Jul, 2011

2 commits


14 May, 2011

1 commit


31 Mar, 2011

1 commit


24 Feb, 2011

1 commit