14 Aug, 2010

1 commit

  • * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
    O2net: Disallow o2net accept connection request from itself.
    ocfs2/dlm: remove potential deadlock -V3
    ocfs2/dlm: avoid incorrect bit set in refmap on recovery master
    Fix the nested PR lock calling issue in ACL
    ocfs2: Count more refcount records in file system fragmentation.
    ocfs2 fix o2dlm dlm run purgelist (rev 3)
    ocfs2/dlm: fix a dead lock
    ocfs2: do not overwrite error codes in ocfs2_init_acl

    Linus Torvalds
     

11 Aug, 2010

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (96 commits)
    no need for list_for_each_entry_safe()/resetting with superblock list
    Fix sget() race with failing mount
    vfs: don't hold s_umount over close_bdev_exclusive() call
    sysv: do not mark superblock dirty on remount
    sysv: do not mark superblock dirty on mount
    btrfs: remove junk sb_dirt change
    BFS: clean up the superblock usage
    AFFS: wait for sb synchronization when needed
    AFFS: clean up dirty flag usage
    cifs: truncate fallout
    mbcache: fix shrinker function return value
    mbcache: Remove unused features
    add f_flags to struct statfs(64)
    pass a struct path to vfs_statfs
    update VFS documentation for method changes.
    All filesystems that need invalidate_inode_buffers() are doing that explicitly
    convert remaining ->clear_inode() to ->evict_inode()
    Make ->drop_inode() just return whether inode needs to be dropped
    fs/inode.c:clear_inode() is gone
    fs/inode.c:evict() doesn't care about delete vs. non-delete paths now
    ...

    Fix up trivial conflicts in fs/nilfs2/super.c

    Linus Torvalds
     

10 Aug, 2010

6 commits

  • Signed-off-by: Al Viro

    Al Viro
     
  • ... and let iput_final() do the actual eviction or retention

    Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Make sure we check the truncate constraints early on in ->setattr by adding
    those checks to inode_change_ok. Also clean up and document inode_change_ok
    to make this obvious.

    As a fallout we don't have to call inode_newsize_ok from simple_setsize and
    simplify it down to a truncate_setsize which doesn't return an error. This
    simplifies a lot of setattr implementations and means we use truncate_setsize
    almost everywhere. Get rid of fat_setsize now that it's trivial and mark
    ext2_setsize static to make the calling convention obvious.

    Keep the inode_newsize_ok in vmtruncate for now as all callers need an
    audit for its removal anyway.

    Note: setattr code in ecryptfs doesn't call inode_change_ok at all and
    needs a deeper audit, but that is left for later.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Replace inode_setattr with opencoded variants of it in all callers. This
    moves the remaining call to vmtruncate into the filesystem methods where it
    can be replaced with the proper truncate sequence.

    In a few cases it was obvious that we would never end up calling vmtruncate
    so it was left out in the opencoded variant:

    spufs: explicitly checks for ATTR_SIZE earlier
    btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier
    ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above

    In addition to that ncpfs called inode_setattr with handcrafted iattrs,
    which allowed to trim down the opencoded variant.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Move the call to vmtruncate to get rid of accessive blocks to the callers
    in prepearation of the new truncate calling sequence. This was only done
    for DIO_LOCKING filesystems, so the __blockdev_direct_IO_newtrunc variant
    was not needed anyway. Get rid of blockdev_direct_IO_no_locking and
    its _newtrunc variant while at it as just opencoding the two additional
    paramters is shorted than the name suffix.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

08 Aug, 2010

9 commits

  • * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (40 commits)
    ext4: Adding error check after calling ext4_mb_regular_allocator()
    ext4: Fix dirtying of journalled buffers in data=journal mode
    ext4: re-inline ext4_rec_len_(to|from)_disk functions
    jbd2: Remove t_handle_lock from start_this_handle()
    jbd2: Change j_state_lock to be a rwlock_t
    jbd2: Use atomic variables to avoid taking t_handle_lock in jbd2_journal_stop
    ext4: Add mount options in superblock
    ext4: force block allocation on quota_off
    ext4: fix freeze deadlock under IO
    ext4: drop inode from orphan list if ext4_delete_inode() fails
    ext4: check to make make sure bd_dev is set before dereferencing it
    jbd2: Make barrier messages less scary
    ext4: don't print scary messages for allocation failures post-abort
    ext4: fix EFBIG edge case when writing to large non-extent file
    ext4: fix ext4_get_blocks references
    ext4: Always journal quota file modifications
    ext4: Fix potential memory leak in ext4_fill_super
    ext4: Don't error out the fs if the user tries to make a file too big
    ext4: allocate stripe-multiple IOs on stripe boundaries
    ext4: move aio completion after unwritten extent conversion
    ...

    Fix up conflicts in fs/ext4/inode.c as per Ted.

    Fix up xfs conflicts as per earlier xfs merge.

    Linus Torvalds
     
  • Currently, o2net_accept_one() is allowed to accept a connection from
    listening node itself, such a fake connection will not be successfully
    established due to no handshake detected afterwards, and later end up
    with triggering connecting worker in a loop.

    We're going to fix this by treating such connection request as 'invalid',
    since we've got no chance of requesting connection from a node to itself
    in a OCFS2 cluster.

    The fix doesn't hurt user's scan for o2net-listener, it always gets a
    successful connection from userpace.

    Signed-off-by: Tristan Ye
    Acked-by: Sunil Mushran
    Signed-off-by: Joel Becker

    Tristan Ye
     
  • When we need to take both dlm_domain_lock and dlm->spinlock, we should take
    them in order of: dlm_domain_lock then dlm->spinlock.

    There is pathes disobey this order. That is calling dlm_lockres_put() with
    dlm->spinlock held in dlm_run_purge_list. dlm_lockres_put() calls dlm_put() at
    the ref and dlm_put() locks on dlm_domain_lock.

    Fix:
    Don't grab/put the dlm when the initialising/releasing lockres.
    That grab is not required because we don't call dlm_unregister_domain()
    based on refcount.

    Signed-off-by: Wengang Wang
    Cc: stable@kernel.org
    Signed-off-by: Joel Becker

    Wengang Wang
     
  • In the following situation, there remains an incorrect bit in refmap on the
    recovery master. Finally the recovery master will fail at purging the lockres
    due to the incorrect bit in refmap.

    1) node A has no interest on lockres A any longer, so it is purging it.
    2) the owner of lockres A is node B, so node A is sending de-ref message
    to node B.
    3) at this time, node B crashed. node C becomes the recovery master. it recovers
    lockres A(because the master is the dead node B).
    4) node A migrated lockres A to node C with a refbit there.
    5) node A failed to send de-ref message to node B because it crashed. The failure
    is ignored. no other action is done for lockres A any more.

    For mormal, re-send the deref message to it to recovery master can fix it. Well,
    ignoring the failure of deref to the original master and not recovering the lockres
    to recovery master has the same effect. And the later is simpler.

    Signed-off-by: Wengang Wang
    Acked-by: Srinivas Eeda
    Cc: stable@kernel.org
    Signed-off-by: Joel Becker

    Wengang Wang
     
  • Hi,

    Thanks a lot for all the review and comments so far;) I'd like to send
    the improved (V4) version of this patch.

    This patch fixes a deadlock in OCFS2 ACL. We found this bug in OCFS2
    and Samba integration using scenario, the symptom is several smbd
    processes will be hung under heavy workload. Finally we found out it
    is the nested PR lock calling that leads to this deadlock:

    node1 node2
    gr PR
    |
    V
    PR(EX)---> BAST:OCFS2_LOCK_BLOCKED
    |
    V
    rq PR
    |
    V
    wait=1

    After requesting the 2nd PR lock, the process "smbd" went into D
    state. It can only be woken up when the 1st PR lock's RO holder equals
    zero. There should be an ocfs2_inode_unlock in the calling path later
    on, which can decrement the RO holder. But since it has been in
    uninterruptible sleep, the unlock function has no chance to be called.

    The related stack trace is:
    smbd D ffff8800013d0600 0 9522 5608 0x00000000
    ffff88002ca7fb18 0000000000000282 ffff88002f964500 ffff88002ca7fa98
    ffff8800013d0600 ffff88002ca7fae0 ffff88002f964340 ffff88002f964340
    ffff88002ca7ffd8 ffff88002ca7ffd8 ffff88002f964340 ffff88002f964340
    Call Trace:
    [] schedule_timeout+0x175/0x210
    [] wait_for_common+0xf0/0x210
    [] __ocfs2_cluster_lock+0x3b9/0xa90 [ocfs2]
    [] ocfs2_inode_lock_full_nested+0x255/0xdb0 [ocfs2]
    [] ocfs2_get_acl+0x69/0x120 [ocfs2]
    [] ocfs2_check_acl+0x28/0x80 [ocfs2]
    [] acl_permission_check+0x57/0xb0
    [] generic_permission+0x1d/0xc0
    [] ocfs2_permission+0x10a/0x1d0 [ocfs2]
    [] inode_permission+0x45/0x100
    [] sys_chdir+0x53/0x90
    [] system_call_fastpath+0x16/0x1b
    [] 0x7f34a4ef6927

    For details, please see:
    https://bugzilla.novell.com/show_bug.cgi?id=614332 and
    http://oss.oracle.com/bugzilla/show_bug.cgi?id=1278

    Signed-off-by: Jiaju Zhang
    Acked-by: Mark Fasheh
    Cc: stable@kernel.org
    Signed-off-by: Joel Becker

    Jiaju Zhang
     
  • The refcount record calculation in ocfs2_calc_refcount_meta_credits
    is too optimistic that we can always allocate contiguous clusters
    and handle an already existed refcount rec as a whole. Actually
    because of file system fragmentation, we may have the chance to split
    a refcount record into 3 parts during the transaction. So consider
    the worst case in record calculation.

    Cc: stable@kernel.org
    Signed-off-by: Tao Ma
    Signed-off-by: Joel Becker

    Tao Ma
     
  • This patch fixes two problems in dlm_run_purgelist

    1. If a lockres is found to be in use, dlm_run_purgelist keeps trying to purge
    the same lockres instead of trying the next lockres.

    2. When a lockres is found unused, dlm_run_purgelist releases lockres spinlock
    before setting DLM_LOCK_RES_DROPPING_REF and calls dlm_purge_lockres.
    spinlock is reacquired but in this window lockres can get reused. This leads
    to BUG.

    This patch modifies dlm_run_purgelist to skip lockres if it's in use and purge
    next lockres. It also sets DLM_LOCK_RES_DROPPING_REF before releasing the
    lockres spinlock protecting it from getting reused.

    Signed-off-by: Srinivas Eeda
    Acked-by: Sunil Mushran
    Cc: stable@kernel.org
    Signed-off-by: Joel Becker

    Srinivas Eeda
     
  • When we have to take both dlm->master_lock and lockres->spinlock,
    take them in order

    lockres->spinlock and then dlm->master_lock.

    The patch fixes a violation of the rule.
    We can simply move taking dlm->master_lock to where we have dropped res->spinlock
    since when we access res->state and free mle memory we don't need master_lock's
    protection.

    Signed-off-by: Wengang Wang
    Cc: stable@kernel.org
    Signed-off-by: Joel Becker

    Wengang Wang
     
  • Setting the acl while creating a new inode depends on
    the error codes of posix_acl_create_masq. This patch fix
    a issue of overwriting the error codes of it.

    Reported-by: Pawel Zawora
    Cc: [ .33, .34 ]
    Signed-off-by: Tiger Yang
    Signed-off-by: Joel Becker

    Tiger Yang
     

04 Aug, 2010

2 commits


27 Jul, 2010

2 commits

  • Filesystems with unwritten extent support must not complete an AIO request
    until the transaction to convert the extent has been commited. That means
    the aio_complete calls needs to be moved into the ->end_io callback so
    that the filesystem can control when to call it exactly.

    This makes a bit of a mess out of dio_complete and the ->end_io callback
    prototype even more complicated.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Christoph Hellwig
     
  • Filesystems with unwritten extent support must not complete an AIO request
    until the transaction to convert the extent has been commited. That means
    the aio_complete calls needs to be moved into the ->end_io callback so
    that the filesystem can control when to call it exactly.

    This makes a bit of a mess out of dio_complete and the ->end_io callback
    prototype even more complicated.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

20 Jul, 2010

1 commit


19 Jul, 2010

1 commit

  • * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
    ocfs2: Silence gcc warning in ocfs2_write_zero_page().
    jbd2/ocfs2: Fix block checksumming when a buffer is used in several transactions
    ocfs2/dlm: Remove BUG_ON from migration in the rare case of a down node
    ocfs2: Don't duplicate pages past i_size during CoW.
    ocfs2: tighten up strlen() checking
    ocfs2: Make xattr reflink work with new local alloc reservation.
    ocfs2: make xattr extension work with new local alloc reservation.
    ocfs2: Remove the redundant cpu_to_le64.
    ocfs2/dlm: don't access beyond bitmap size
    ocfs2: No need to zero pages past i_size.
    ocfs2: Zero the tail cluster when extending past i_size.
    ocfs2: When zero extending, do it by page.
    ocfs2: Limit default local alloc size within bitmap range.
    ocfs2: Move orphan scan work to ocfs2_wq.
    fs/ocfs2/dlm: Add missing spin_unlock

    Linus Torvalds
     

17 Jul, 2010

1 commit


16 Jul, 2010

3 commits

  • OCFS2 uses t_commit trigger to compute and store checksum of the just
    committed blocks. When a buffer has b_frozen_data, checksum is computed
    for it instead of b_data but this can result in an old checksum being
    written to the filesystem in the following scenario:

    1) transaction1 is opened
    2) handle1 is opened
    3) journal_access(handle1, bh)
    - This sets jh->b_transaction to transaction1
    4) modify(bh)
    5) journal_dirty(handle1, bh)
    6) handle1 is closed
    7) start committing transaction1, opening transaction2
    8) handle2 is opened
    9) journal_access(handle2, bh)
    - This copies off b_frozen_data to make it safe for transaction1 to commit.
    jh->b_next_transaction is set to transaction2.
    10) jbd2_journal_write_metadata() checksums b_frozen_data
    11) the journal correctly writes b_frozen_data to the disk journal
    12) handle2 is closed
    - There was no dirty call for the bh on handle2, so it is never queued for
    any more journal operation
    13) Checkpointing finally happens, and it just spools the bh via normal buffer
    writeback. This will write b_data, which was never triggered on and thus
    contains a wrong (old) checksum.

    This patch fixes the problem by calling the trigger at the moment data is
    frozen for journal commit - i.e., either when b_frozen_data is created by
    do_get_write_access or just before we write a buffer to the log if
    b_frozen_data does not exist. We also rename the trigger to t_frozen as
    that better describes when it is called.

    Signed-off-by: Jan Kara
    Signed-off-by: Mark Fasheh
    Signed-off-by: Joel Becker

    Jan Kara
     
  • For migration, we are waiting for DLM_LOCK_RES_MIGRATING flag to be set
    before sending DLM_MIG_LOCKRES_MSG message to the target. We are using
    dlm_migration_can_proceed() for that purpose. However, if the node is
    down, dlm_migration_can_proceed() will also return "go ahead". In this
    rare case, the DLM_LOCK_RES_MIGRATING flag might not be set yet. Remove
    the BUG_ON() that trips over this condition.

    Signed-off-by: Wengang Wang
    Signed-off-by: Joel Becker

    Wengang Wang
     
  • During CoW, the pages after i_size don't contain valid data, so there's
    no need to read and duplicate them.

    Signed-off-by: Tao Ma
    Signed-off-by: Joel Becker

    Tao Ma
     

13 Jul, 2010

6 commits

  • This function is only called from one place and it's like this:
    dlm_register_domain(conn->cc_name, dlm_key, &fs_version);

    The "conn->cc_name" is 64 characters long. If strlen(conn->cc_name)
    were equal to O2NM_MAX_NAME_LEN (64) that would be a bug because
    strlen() doesn't count the NULL character.

    In fact, if you look how O2NM_MAX_NAME_LEN is used, it mostly describes
    64 character buffers. The only exception is nd_name from struct
    o2nm_node.

    Anyway I looked into it and in this case the domain string comes from
    osb->uuid_str in ocfs2_setup_osb_uuid(). That's 32 characters and NULL
    which easily fits into O2NM_MAX_NAME_LEN. This patch doesn't change how
    the code works, but I think it makes the code a little cleaner.

    Signed-off-by: Dan Carpenter
    Signed-off-by: Joel Becker

    Dan Carpenter
     
  • The new reservation code in local alloc has add the limitation
    that the caller should handle the case that the local alloc
    doesn't give use enough contiguous clusters. It make the old
    xattr reflink code broken.

    So this patch udpate the xattr reflink code so that it can
    handle the case that local alloc give us one cluster at a time.

    Signed-off-by: Tao Ma
    Signed-off-by: Joel Becker

    Tao Ma
     
  • The old ocfs2_xattr_extent_allocation is too optimistic about
    the clusters we can get. So actually if the file system is
    too fragmented, ocfs2_add_clusters_in_btree will return us
    with EGAIN and we need to allocate clusters once again.

    So this patch change it to a while loop so that we can allocate
    clusters until we reach clusters_to_add.

    Signed-off-by: Tao Ma
    Signed-off-by: Joel Becker
    Cc: stable@kernel.org

    Tao Ma
     
  • In ocfs2_block_group_alloc, we set c_blkno by bg->bg_blkno.
    But actually bg->bg_blkno is already changed to little endian
    in ocfs2_block_group_fill. So remove the extra cpu_to_le64.

    Reported-by: Marcos Matsunaga
    Signed-off-by: Tao Ma
    Signed-off-by: Joel Becker

    Tao Ma
     
  • dlm->recovery_map is defined as
    unsigned long recovery_map[BITS_TO_LONGS(O2NM_MAX_NODES)];

    We should treat O2NM_MAX_NODES as the bit map size in bits.
    This patches fixes a bit operation that takes O2NM_MAX_NODES + 1 as bitmap size.

    Signed-off-by: Wengang Wang
    Signed-off-by: Joel Becker

    Wengang Wang
     
  • When ocfs2 fills a hole, it does so by allocating clusters. When a
    cluster is larger than the write, ocfs2 must zero the portions of the
    cluster outside of the write. If the clustersize is smaller than a
    pagecache page, this is handled by the normal pagecache mechanisms, but
    when the clustersize is larger than a page, ocfs2's write code will zero
    the pages adjacent to the write. This makes sure the entire cluster is
    zeroed correctly.

    Currently ocfs2 behaves exactly the same when writing past i_size.
    However, this means ocfs2 is writing zeroed pages for portions of a new
    cluster that are beyond i_size. The page writeback code isn't expecting
    this. It treats all pages past the one containing i_size as left behind
    due to a previous truncate operation.

    Thankfully, ocfs2 calculates the number of pages it will be working on
    up front. The rest of the write code merely honors the original
    calculation. We can simply trim the number of pages to only cover the
    actual file data.

    Signed-off-by: Joel Becker
    Cc: stable@kernel.org

    Joel Becker
     

09 Jul, 2010

2 commits

  • ocfs2's allocation unit is the cluster. This can be larger than a block
    or even a memory page. This means that a file may have many blocks in
    its last extent that are beyond the block containing i_size. There also
    may be more unwritten extents after that.

    When ocfs2 grows a file, it zeros the entire cluster in order to ensure
    future i_size growth will see cleared blocks. Unfortunately,
    block_write_full_page() drops the pages past i_size. This means that
    ocfs2 is actually leaking garbage data into the tail end of that last
    cluster. This is a bug.

    We adjust ocfs2_write_begin_nolock() and ocfs2_extend_file() to detect
    when a write or truncate is past i_size. They will use
    ocfs2_zero_extend() to ensure the data is properly zeroed.

    Older versions of ocfs2_zero_extend() simply zeroed every block between
    i_size and the zeroing position. This presumes three things:

    1) There is allocation for all of these blocks.
    2) The extents are not unwritten.
    3) The extents are not refcounted.

    (1) and (2) hold true for non-sparse filesystems, which used to be the
    only users of ocfs2_zero_extend(). (3) is another bug.

    Since we're now using ocfs2_zero_extend() for sparse filesystems as
    well, we teach ocfs2_zero_extend() to check every extent between
    i_size and the zeroing position. If the extent is unwritten, it is
    ignored. If it is refcounted, it is CoWed. Then it is zeroed.

    Signed-off-by: Joel Becker
    Cc: stable@kernel.org

    Joel Becker
     
  • ocfs2_zero_extend() does its zeroing block by block, but it calls a
    function named ocfs2_write_zero_page(). Let's have
    ocfs2_write_zero_page() handle the page level. From
    ocfs2_zero_extend()'s perspective, it is now page-at-a-time.

    Signed-off-by: Joel Becker
    Cc: stable@kernel.org

    Joel Becker
     

28 Jun, 2010

1 commit

  • Implicit slab.h inclusion via percpu.h is about to go away. Make sure
    gfp.h or slab.h is included as necessary.

    Signed-off-by: Tejun Heo
    Cc: Stephen Rothwell
    Cc: Joel Becker
    Signed-off-by: Stephen Rothwell

    Tejun Heo
     

17 Jun, 2010

2 commits


16 Jun, 2010

2 commits

  • In commit 6b82021b9e91cd689fdffadbcdb9a42597bbe764, we increase
    our local alloc size and calculate how much megabytes we can
    get according to group size and volume size.
    But we also need to check the maximum bits a local alloc block
    bitmap can have. With a bs=512, cs=32K, local volume with 160G,
    it calculate 96MB while the maximum local alloc size is only
    76M. So the bitmap will overflow and corrupt the system truncate
    log file. See bug
    http://oss.oracle.com/bugzilla/show_bug.cgi?id=1262

    Signed-off-by: Tao Ma
    Acked-by: Mark Fasheh
    Signed-off-by: Joel Becker

    Tao Ma
     
  • We used to let orphan scan work in the default work queue,
    but there is a corner case which will make the system deadlock.
    The scenario is like this:
    1. set heartbeat threadshold to 200. this will allow us to have a
    great chance to have a orphan scan work before our quorum decision.
    2. mount node 1.
    3. after 1~2 minutes, mount node 2(in order to make the bug easier
    to reproduce, better add maxcpus=1 to kernel command line).
    4. node 1 do orphan scan work.
    5. node 2 do orphan scan work.
    6. node 1 do orphan scan work. After this, node 1 hold the orphan scan
    lock while node 2 know node 1 is the master.
    7. ifdown eth2 in node 2(eth2 is what we do ocfs2 interconnection).

    Now when node 2 begins orphan scan, the system queue is blocked.

    The root cause is that both orphan scan work and quorum decision work
    will use the system event work queue. orphan scan has a chance of
    blocking the event work queue(in dlm_wait_for_node_death) so that there
    is no chance for quorum decision work to proceed.

    This patch resolve it by moving orphan scan work to ocfs2_wq.

    Signed-off-by: Tao Ma
    Signed-off-by: Joel Becker

    Tao Ma