25 Sep, 2019

1 commit

  • There is no need to check return value of debugfs_create functions, but
    the last sweep through ocfs missed a number of places where this was
    happening. There is also no need to save the individual dentries for the
    debugfs files, as everything is can just be removed at once when the
    directory is removed.

    By getting rid of the file dentries for the debugfs entries, a bit of
    local memory can be saved as well.

    [colin.king@canonical.com: ensure ret is set to zero before returning]
    Link: http://lkml.kernel.org/r/20190807121929.28918-1-colin.king@canonical.com
    Link: http://lkml.kernel.org/r/20190731132119.GA12603@kroah.com
    Signed-off-by: Greg Kroah-Hartman
    Signed-off-by: Colin Ian King
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Jia Guo
    Cc: Junxiao Bi
    Cc: Changwei Ge
    Cc: Gang He
    Cc: Jun Piao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Kroah-Hartman
     

13 Jul, 2019

3 commits

  • ocfs2 file system uses locking_state file under debugfs to dump each
    ocfs2 file system's dlm lock resources, but the users ever encountered
    some hang(deadlock) problems in ocfs2 file system. I'd like to add
    first lock wait time in locking_state file, which can help the upper
    scripts detect these deadlock problems via comparing the first lock wait
    time with the current time.

    Link: http://lkml.kernel.org/r/20190611015414.27754-3-ghe@suse.com
    Signed-off-by: Gang He
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Changwei Ge
    Cc: Gang He
    Cc: Jun Piao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gang He
     
  • Add locking filter debugfs file, which is used to filter lock resources
    dump from locking_state debugfs file. We use d_filter_secs field to
    filter lock resources dump, the default d_filter_secs(0) value filters
    nothing, otherwise, only dump the last N seconds active lock resources.
    This enhancement can avoid dumping lots of old records. The
    d_filter_secs value can be changed via locking_filter file.

    [akpm@linux-foundation.org: fix undefined reference to `__udivdi3']
    Link: http://lkml.kernel.org/r/20190611015414.27754-2-ghe@suse.com
    Signed-off-by: Gang He
    Reviewed-by: Joseph Qi
    Acked-by: Randy Dunlap [build-tested]
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Changwei Ge
    Cc: Gang He
    Cc: Jun Piao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gang He
     
  • ocfs2 file system uses locking_state file under debugfs to dump each
    ocfs2 file system's dlm lock resources, but the dlm lock resources in
    memory are becoming more and more after the files were touched by the
    user. it will become a bit difficult to analyze these dlm lock resource
    records in locking_state file by the upper scripts, though some files
    are not active for now, which were accessed long time ago.

    Then, I'd like to add last pr/ex unlock times in locking_state file for
    each dlm lock resource record, the the upper scripts can use last unlock
    time to filter inactive dlm lock resource record.

    Link: http://lkml.kernel.org/r/20190611015414.27754-1-ghe@suse.com
    Signed-off-by: Gang He
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Changwei Ge
    Cc: Gang He
    Cc: Jun Piao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gang He
     

31 May, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your option any later version this program is distributed in the
    hope that it will be useful but without any warranty without even
    the implied warranty of merchantability or fitness for a particular
    purpose see the gnu general public license for more details you
    should have received a copy of the gnu general public license along
    with this program if not write to the free software foundation inc
    59 temple place suite 330 boston ma 021110 1307 usa

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-or-later

    has been chosen to replace the boilerplate/reference in 84 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Richard Fontana
    Reviewed-by: Allison Randal
    Reviewed-by: Kate Stewart
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190524100844.756442981@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

06 Mar, 2019

1 commit

  • The user reported this problem, the upper application IO was timeout
    when fstrim was running on this ocfs2 partition. the application
    monitoring resource agent considered that this application did not work,
    then this node was fenced by the cluster brain (e.g. pacemaker).

    The root cause is that fstrim thread always holds main_bm meta-file
    related locks until all the cluster groups are trimmed. This patch will
    make fstrim thread release main_bm meta-file related locks when each
    cluster group is trimmed, this will let the current application IO has a
    chance to claim the clusters from main_bm meta-file.

    Link: http://lkml.kernel.org/r/20190111090014.31645-1-ghe@suse.com
    Signed-off-by: Gang He
    Reviewed-by: Changwei Ge
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gang He
     

06 Apr, 2018

1 commit

  • Use embedded kobject mechanism for online file check feature, this will
    avoid to use a global list to save/search per-device online file check
    related data, meanwhile, reduce the code lines and make the code logic
    clear. The changed code is based on Goldwyn Rodrigues's patches and
    ext4 fs code.

    Link: http://lkml.kernel.org/r/1495611866-27360-4-git-send-email-ghe@suse.com
    Signed-off-by: Gang He
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gang He
     

01 Feb, 2018

1 commit

  • Introduce a new dlm lock resource, which will be used to communicate
    during fstrimming of an ocfs2 device from cluster nodes.

    Link: http://lkml.kernel.org/r/1513228484-2084-1-git-send-email-ghe@suse.com
    Signed-off-by: Gang He
    Reviewed-by: Changwei Ge
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gang He
     

07 Sep, 2017

1 commit

  • clean up some unused functions and parameters.

    Link: http://lkml.kernel.org/r/598A5E21.2080807@huawei.com
    Signed-off-by: Jun Piao
    Reviewed-by: Alex Chen
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jun Piao
     

23 Feb, 2017

1 commit

  • We are in the situation that we have to avoid recursive cluster locking,
    but there is no way to check if a cluster lock has been taken by a precess
    already.

    Mostly, we can avoid recursive locking by writing code carefully.
    However, we found that it's very hard to handle the routines that are
    invoked directly by vfs code. For instance:

    const struct inode_operations ocfs2_file_iops = {
    .permission = ocfs2_permission,
    .get_acl = ocfs2_iop_get_acl,
    .set_acl = ocfs2_iop_set_acl,
    };

    Both ocfs2_permission() and ocfs2_iop_get_acl() call ocfs2_inode_lock(PR):

    do_sys_open
    may_open
    inode_permission
    ocfs2_permission
    ocfs2_inode_lock()
    Reviewed-by: Junxiao Bi
    Reviewed-by: Joseph Qi
    Cc: Stephen Rothwell
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Ren
     

13 Dec, 2016

1 commit

  • struct timespec is not y2038 safe. Use time64_t which is y2038 safe to
    represent orphan scan times. time64_t is sufficient here as only the
    seconds delta times are relevant.

    Also use appropriate time functions that return time in time64_t format.
    Time functions now return monotonic time instead of real time as only
    delta scan times are relevant and these values are not persistent across
    reboots.

    The format string for the debug print is still using long as this is
    only the time elapsed since the last scan and long is sufficient to
    represent this value.

    Link: http://lkml.kernel.org/r/1475365138-20567-1-git-send-email-deepa.kernel@gmail.com
    Signed-off-by: Deepa Dinamani
    Reviewed-by: Arnd Bergmann
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Deepa Dinamani
     

05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

26 Mar, 2016

1 commit

  • This patch fixes a deadlock, as follows:

    Node 1 Node 2 Node 3
    1)volume a and b are only mount vol a only mount vol b
    mounted

    2) start to mount b start to mount a

    3) check hb of Node 3 check hb of Node 2
    in vol a, qs_holds++ in vol b, qs_holds++

    4) -------------------- all nodes' network down --------------------

    5) progress of mount b the same situation as
    failed, and then call Node 2
    ocfs2_dismount_volume.
    but the process is hung,
    since there is a work
    in ocfs2_wq cannot beo
    completed. This work is
    about vol a, because
    ocfs2_wq is global wq.
    BTW, this work which is
    scheduled in ocfs2_wq is
    ocfs2_orphan_scan_work,
    and the context in this work
    needs to take inode lock
    of orphan_dir, because
    lockres owner are Node 1 and
    all nodes' nework has been down
    at the same time, so it can't
    get the inode lock.

    6) Why can't this node be fenced
    when network disconnected?
    Because the process of
    mount is hung what caused qs_holds
    is not equal 0.

    Because all works in the ocfs2_wq are relative to the super block.

    The solution is to change the ocfs2_wq from global to local. In other
    words, move it into struct ocfs2_super.

    Signed-off-by: Yiwen Jiang
    Reviewed-by: Joseph Qi
    Cc: Xue jiufei
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Cc: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    jiangyiwen
     

05 Sep, 2015

1 commit

  • OCFS2 is often used in high-availaibility systems. However, ocfs2
    converts the filesystem to read-only at the drop of the hat. This may
    not be necessary, since turning the filesystem read-only would affect
    other running processes as well, decreasing availability.

    This attempt is to add errors=continue, which would return the EIO to
    the calling process and terminate furhter processing so that the
    filesystem is not corrupted further. However, the filesystem is not
    converted to read-only.

    As a future plan, I intend to create a small utility or extend
    fsck.ocfs2 to fix small errors such as in the inode. The input to the
    utility such as the inode can come from the kernel logs so we don't have
    to schedule a downtime for fixing small-enough errors.

    The patch changes the ocfs2_error to return an error. The error
    returned depends on the mount option set. If none is set, the default
    is to turn the filesystem read-only.

    Perhaps errors=continue is not the best option name. Historically it is
    used for making an attempt to progress in the current process itself.
    Should we call it errors=eio? or errors=killproc? Suggestions/Comments
    welcome.

    Sources are available at:
    https://github.com/goldwynr/linux/tree/error-cont

    Signed-off-by: Goldwyn Rodrigues
    Signed-off-by: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Goldwyn Rodrigues
     

25 Jun, 2015

1 commit


13 Mar, 2015

1 commit

  • It turns out that making this feature ro_compat isn't quite enough to
    prevent accidental corruption on mount from older kernels. Ocfs2 (like
    other file systems) will process orphaned inodes even when the user mounts
    in 'ro' mode. So for the case of a filesystem not knowing the append_dio
    feature, mounting the filesystem could result in orphaned-for-dio files
    being deleted, which we clearly don't want.

    So instead, turn this into an incompat flag.

    Btw, this is kind of my fault - initially I asked that we add a flag to
    cover the feature and even suggested that we use an ro flag. It wasn't
    until I was looking through our commits for v4.0-rc1 that I realized we
    actually want this to be incompat.

    Signed-off-by: Mark Fasheh
    Cc: Joseph Qi
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mark Fasheh
     

17 Feb, 2015

3 commits

  • Intruduce a bit OCFS2_FEATURE_RO_COMPAT_APPEND_DIO and check it in
    write flow. If the bit is not set, fall back to the old way.

    Signed-off-by: Joseph Qi
    Cc: Weiwei Wang
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Mark Fasheh
    Cc: Xuejiufei
    Cc: alex chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • Implement ocfs2_direct_IO_write. Add the inode to orphan dir first, and
    then delete it once append O_DIRECT finished.

    This is to make sure block allocation and inode size are consistent.

    [akpm@linux-foundation.org: fix it for "block: Add discard flag to blkdev_issue_zeroout() function"]
    Signed-off-by: Joseph Qi
    Cc: Weiwei Wang
    Cc: Junxiao Bi
    Cc: Joel Becker
    Cc: Mark Fasheh
    Cc: Xuejiufei
    Cc: alex chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • Define two orphan recovery types, which indicates if need truncate file or
    not.

    Signed-off-by: Joseph Qi
    Cc: Weiwei Wang
    Cc: Junxiao Bi
    Cc: Joel Becker
    Cc: Mark Fasheh
    Cc: Xuejiufei
    Cc: alex chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     

11 Feb, 2015

1 commit

  • Add a mount option to support JBD2 feature:

    JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT. When this feature is opened, journal
    commit block can be written to disk without waiting for descriptor blocks,
    which can improve journal commit performance. This option will enable
    'journal_checksum' internally.

    Using the fs_mark benchmark, using journal_async_commit shows a 50%
    improvement, the files per second go up from 215.2 to 317.5.

    test script:
    fs_mark -d /mnt/ocfs2/ -s 10240 -n 1000

    default:
    FSUse% Count Size Files/sec App Overhead
    0 1000 10240 215.2 17878

    with journal_async_commit option:
    FSUse% Count Size Files/sec App Overhead
    0 1000 10240 317.5 17881

    Signed-off-by: Alex Chen
    Signed-off-by: Weiwei Wang
    Reviewed-by: Joseph Qi
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    alex chen
     

11 Dec, 2014

1 commit

  • ocfs2_readpages() use nonblocking flag to avoid page lock inversion. It
    will trigger cluster hang because that flag OCFS2_LOCK_UPCONVERT_FINISHING
    is not cleared if nonblocking lock cannot be granted at once. The flag
    would prevent dc thread from downconverting. So other nodes cannot
    acheive this lockres for ever.

    So we should not set OCFS2_LOCK_UPCONVERT_FINISHING when receiving ast if
    nonblocking lock had already returned.

    Signed-off-by: joyce.xue
    Reviewed-by: Junxiao Bi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xue jiufei
     

05 Jun, 2014

1 commit

  • Revert commit 75f82eaa502c ("ocfs2: fix NULL pointer dereference when
    dismount and ocfs2rec simultaneously") because it may cause a umount
    hang while shutting down the truncate log.

    fix NULL pointer dereference when dismount and ocfs2rec simultaneously

    The situation is as followes:
    ocfs2_dismout_volume
    -> ocfs2_recovery_exit
    -> free osb->recovery_map
    -> ocfs2_truncate_shutdown
    -> lock global bitmap inode
    -> ocfs2_wait_for_recovery
    -> check whether osb->recovery_map->rm_used is zero

    Because osb->recovery_map is already freed, rm_used can be any other
    values, so it may yield umount hang.

    To prevent NULL pointer dereference while getting sys_root_inode, we use
    a osb_tl_disable flag to disable schedule osb_truncate_log_wq after
    truncate log shutdown.

    Signed-off-by: joyce.xue
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xue jiufei
     

04 Apr, 2014

3 commits

  • The following case may lead to the same system inode ref in confusion.

    A thread B thread
    ocfs2_get_system_file_inode
    ->get_local_system_inode
    ->_ocfs2_get_system_file_inode
    because of *arr == NULL,
    ocfs2_get_system_file_inode
    ->get_local_system_inode
    ->_ocfs2_get_system_file_inode
    gets first ref thru
    _ocfs2_get_system_file_inode,
    gets second ref thru igrab and
    set *arr = inode
    at the moment, B thread also gets
    two refs, so lead to one more
    inode ref.

    So add mutex lock to avoid multi thread set two inode ref once at the
    same time.

    Signed-off-by: jiangyiwen
    Reviewed-by: Joseph Qi
    Cc: Joel Becker
    Cc: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    jiangyiwen
     
  • The following patches are reverted in this patch because these patches
    caused performance regression in the remote unlink() calls.

    ea455f8ab683 - ocfs2: Push out dropping of dentry lock to ocfs2_wq
    f7b1aa69be13 - ocfs2: Fix deadlock on umount
    5fd131893793 - ocfs2: Don't oops in ocfs2_kill_sb on a failed mount

    Previous patches in this series removed the possible deadlocks from
    downconvert thread so the above patches shouldn't be needed anymore.

    The regression is caused because these patches delay the iput() in case
    of dentry unlocks. This also delays the unlocking of the open lockres.
    The open lockresource is required to test if the inode can be wiped from
    disk or not. When the deleting node does not get the open lock, it
    marks it as orphan (even though it is not in use by another
    node/process) and causes a journal checkpoint. This delays operations
    following the inode eviction. This also moves the inode to the orphaned
    inode which further causes more I/O and a lot of unneccessary orphans.

    The following script can be used to generate the load causing issues:

    declare -a create
    declare -a remove
    declare -a iterations=(1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384)
    unique="`mktemp -u XXXXX`"
    script="/tmp/idontknow-${unique}.sh"
    cat < "${script}"
    for n in {1..8}; do mkdir -p test/dir\${n}
    eval touch test/dir\${n}/foo{1.."\$1"}
    done
    EOF
    chmod 700 "${script}"

    function fcreate ()
    {
    exec 2>&1 /usr/bin/time --format=%E "${script}" "$1"
    }

    function fremove ()
    {
    exec 2>&1 /usr/bin/time --format=%E ssh node2 "cd `pwd`; rm -Rf test*"
    }

    function fcp ()
    {
    exec 2>&1 /usr/bin/time --format=%E ssh node3 "cd `pwd`; cp -R test test.new"
    }

    echo -------------------------------------------------
    echo "| # files | create #s | copy #s | remove #s |"
    echo -------------------------------------------------
    for ((x=0; x < ${#iterations[*]} ; x++)) do
    create[$x]="`fcreate ${iterations[$x]}`"
    copy[$x]="`fcp ${iterations[$x]}`"
    remove[$x]="`fremove`"
    printf "| %8d | %9s | %9s | %9s |\n" ${iterations[$x]} ${create[$x]} ${copy[$x]} ${remove[$x]}
    done
    rm "${script}"
    echo "------------------------"

    Signed-off-by: Srinivas Eeda
    Signed-off-by: Goldwyn Rodrigues
    Signed-off-by: Jan Kara
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Goldwyn Rodrigues
     
  • We cannot drop last dquot reference from downconvert thread as that
    creates the following deadlock:

    NODE 1 NODE2
    holds dentry lock for 'foo'
    holds inode lock for GLOBAL_BITMAP_SYSTEM_INODE
    dquot_initialize(bar)
    ocfs2_dquot_acquire()
    ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE)
    ...
    downconvert thread (triggered from another
    node or a different process from NODE2)
    ocfs2_dentry_post_unlock()
    ...
    iput(foo)
    ocfs2_evict_inode(foo)
    ocfs2_clear_inode(foo)
    dquot_drop(inode)
    ...
    ocfs2_dquot_release()
    ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE)
    - blocks
    finds we need more space in
    quota file
    ...
    ocfs2_extend_no_holes()
    ocfs2_inode_lock(GLOBAL_BITMAP_SYSTEM_INODE)
    - deadlocks waiting for
    downconvert thread

    We solve the problem by postponing dropping of the last dquot reference to
    a workqueue if it happens from the downconvert thread.

    Signed-off-by: Jan Kara
    Reviewed-by: Mark Fasheh
    Reviewed-by: Srinivas Eeda
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

22 Jan, 2014

1 commit

  • This is an effort of removing ocfs2_controld.pcmk and getting ocfs2 DLM
    handling up to the times with respect to DLM (>=4.0.1) and corosync
    (2.3.x). AFAIK, cman also is being phased out for a unified corosync
    cluster stack.

    fs/dlm performs all the functions with respect to fencing and node
    management and provides the API's to do so for ocfs2. For all future
    references, DLM stands for fs/dlm code.

    The advantages are:
    + No need to run an additional userspace daemon (ocfs2_controld)
    + No controld device handling and controld protocol
    + Shifting responsibilities of node management to DLM layer

    For backward compatibility, we are keeping the controld handling code.
    Once enough time has passed we can remove a significant portion of the
    code. This was tested by using the kernel with changes on older
    unmodified tools. The kernel used ocfs2_controld as expected, and
    displayed the appropriate warning message.

    This feature requires modification in the userspace ocfs2-tools. The
    changes can be found at: https://github.com/goldwynr/ocfs2-tools branch:
    nocontrold Currently, not many checks are present in the userspace code,
    but that would change soon.

    This patch (of 6):

    Add clustername to cluster connection.

    Signed-off-by: Goldwyn Rodrigues
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Goldwyn Rodrigues
     

04 Jul, 2013

1 commit


02 Dec, 2011

1 commit

  • The dqc_bitmap field of struct ocfs2_local_disk_chunk is 32-bit aligned,
    but not 64-bit aligned. The dqc_bitmap is accessed by ocfs2_set_bit(),
    ocfs2_clear_bit(), ocfs2_test_bit(), or ocfs2_find_next_zero_bit(). These
    are wrapper macros for ext2_*_bit() which need to take an unsigned long
    aligned address (though some architectures are able to handle unaligned
    address correctly)

    So some 64bit architectures may not be able to access the dqc_bitmap
    correctly.

    This avoids such unaligned access by using another wrapper functions for
    ext2_*_bit(). The code is taken from fs/ext4/mballoc.c which also need to
    handle unaligned bitmap access.

    Signed-off-by: Akinobu Mita
    Acked-by: Joel Becker
    Cc: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Joel Becker

    Akinobu Mita
     

01 Jun, 2011

1 commit


29 Mar, 2011

1 commit

  • * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: (39 commits)
    Treat writes as new when holes span across page boundaries
    fs,ocfs2: Move o2net_get_func_run_time under CONFIG_OCFS2_FS_STATS.
    ocfs2/dlm: Move kmalloc() outside the spinlock
    ocfs2: Make the left masklogs compat.
    ocfs2: Remove masklog ML_AIO.
    ocfs2: Remove masklog ML_UPTODATE.
    ocfs2: Remove masklog ML_BH_IO.
    ocfs2: Remove masklog ML_JOURNAL.
    ocfs2: Remove masklog ML_EXPORT.
    ocfs2: Remove masklog ML_DCACHE.
    ocfs2: Remove masklog ML_NAMEI.
    ocfs2: Remove mlog(0) from fs/ocfs2/dir.c
    ocfs2: remove NAMEI from symlink.c
    ocfs2: Remove masklog ML_QUOTA.
    ocfs2: Remove mlog(0) from quota_local.c.
    ocfs2: Remove masklog ML_RESERVATIONS.
    ocfs2: Remove masklog ML_XATTR.
    ocfs2: Remove masklog ML_SUPER.
    ocfs2: Remove mlog(0) from fs/ocfs2/heartbeat.c
    ocfs2: Remove mlog(0) from fs/ocfs2/slot_map.c
    ...

    Fix up trivial conflict in fs/ocfs2/super.c

    Linus Torvalds
     

24 Mar, 2011

1 commit

  • As a preparation for removing ext2 non-atomic bit operations from
    asm/bitops.h. This converts ext2 non-atomic bit operations to
    little-endian bit operations.

    Signed-off-by: Akinobu Mita
    Acked-by: Joel Becker
    Cc: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     

20 Feb, 2011

1 commit

  • Patch makes use of the hrtimer to track times in ocfs2 lock stats.

    The patch is a bit involved to ensure no additional impact on the memory
    footprint. The size of ocfs2_inode_cache remains 1280 bytes on 32-bit systems.

    A related change was to modify the unit of the max wait time from nanosec to
    microsec allowing us to track max time larger than 4 secs. This change
    necessitated the bumping of the output version in the debugfs file,
    locking_state, from 2 to 3.

    Signed-off-by: Sunil Mushran
    Signed-off-by: Joel Becker

    Sunil Mushran
     

16 Dec, 2010

1 commit

  • Recently, one of our colleagues meet with a problem that if we
    write/delete a 32mb files repeatly, we will get an ENOSPC in
    the end. And the corresponding bug is 1288.
    http://oss.oracle.com/bugzilla/show_bug.cgi?id=1288

    The real problem is that although we have freed the clusters,
    they are in truncate log and they will be summed up so that
    we can free them once in a whole.

    So this patch just try to resolve it. In case we see -ENOSPC
    in ocfs2_write_begin_no_lock, we will check whether the truncate
    log has enough clusters for our need, if yes, we will try to
    flush the truncate log at that point and try again. This method
    is inspired by Mark Fasheh . Thanks.

    Cc: Mark Fasheh
    Signed-off-by: Tao Ma
    Signed-off-by: Joel Becker

    Tao Ma
     

19 Nov, 2010

1 commit

  • Commit 1c66b360fe262 (Change some lock status member in ocfs2_lock_res
    to char.) states that these fields need to be signed due to comparision
    to -1, but only changed the type from unsigned char to char. However, it
    is a compiler option if char is a signed or unsigned type. Change these
    fields to signed char so the code will work with all compilers.

    Signed-off-by: Milton Miller
    Signed-off-by: Joel Becker

    Milton Miller
     

13 Nov, 2010

1 commit

  • Commit 83fd9c7 changes l_level, l_requested and l_blocking of
    ocfs2_lock_res from int to unsigned char. But actually it is
    initially as -1(ocfs2_lock_res_init_common) which
    correspoding to 255 for unsigned char. So the whole dlm lock
    mechanism doesn't work now which means a disaster to ocfs2.

    Cc: Goldwyn Rodrigues
    Signed-off-by: Tao Ma
    Signed-off-by: Joel Becker

    Tao Ma
     

16 Oct, 2010

1 commit


12 Oct, 2010

1 commit

  • Currently, the default behavior of O_DIRECT writes was allowing
    concurrent writing among nodes to the same file, with no cluster
    coherency guaranteed (no EX lock held). This can leave stale data in
    the cache for buffered reads on other nodes.

    The new mount option introduce a chance to choose two different
    behaviors for O_DIRECT writes:

    * coherency=full, as the default value, will disallow
    concurrent O_DIRECT writes by taking
    EX locks.

    * coherency=buffered, allow concurrent O_DIRECT writes
    without EX lock among nodes, which
    gains high performance at risk of
    getting stale data on other nodes.

    Signed-off-by: Tristan Ye
    Signed-off-by: Joel Becker

    Tristan Ye
     

10 Oct, 2010

1 commit

  • OCFS2_FEATURE_INCOMPAT_CLUSTERINFO allows us to use sb->s_cluster_info for
    both userspace and o2cb cluster stacks. It also allows us to extend cluster
    info to include stack flags.

    This patch also adds stackflags to sb->s_clusterinfo. It also introduces a
    clusterinfo flag OCFS2_CLUSTER_O2CB_GLOBAL_HEARTBEAT to denote the enabled
    global heartbeat mode.

    This incompat flag can be set/cleared using tunefs.ocfs2 --fs-features. The
    clusterinfo flag is set/cleared using tunefs.ocfs2 --update-cluster-stack.

    Signed-off-by: Sunil Mushran

    Sunil Mushran
     

08 Oct, 2010

1 commit


10 Sep, 2010

1 commit

  • Durring orphan scan, if we are slot 0, and we are replaying
    orphan_dir:0001, the general process is that for every file
    in this dir:
    1. we will iget orphan_dir:0001, since there is no inode for it.
    we will have to create an inode and read it from the disk.
    2. do the normal work, such as delete_inode and remove it from
    the dir if it is allowed.
    3. call iput orphan_dir:0001 when we are done. In this case,
    since we have no dcache for this inode, i_count will
    reach 0, and VFS will have to call clear_inode and in
    ocfs2_clear_inode we will checkpoint the inode which will let
    ocfs2_cmt and journald begin to work.
    4. We loop back to 1 for the next file.

    So you see, actually for every deleted file, we have to read the
    orphan dir from the disk and checkpoint the journal. It is very
    time consuming and cause a lot of journal checkpoint I/O.
    A better solution is that we can have another reference for these
    inodes in ocfs2_super. So if there is no other race among
    nodes(which will let dlmglue to checkpoint the inode), for step 3,
    clear_inode won't be called and for step 1, we may only need to
    read the inode for the 1st time. This is a big win for us.

    So this patch will try to cache system inodes of other slots so
    that we will have one more reference for these inodes and avoid
    the extra inode read and journal checkpoint.

    Signed-off-by: Tao Ma
    Signed-off-by: Joel Becker

    Tao Ma