Eric Lee / smarc-fsl-linux-kernel

05 Apr, 2016

1 commit

8f6fd83c6 rhashtable: accept GFP flags in rhashtable_walk_init ... Browse Code »

In certain cases, the 802.11 mesh pathtable code wants to
iterate over all of the entries in the forwarding table from
the receive path, which is inside an RCU read-side critical
section. Enable walks inside atomic sections by allowing
GFP_ATOMIC allocations for the walker state.

Change all existing callsites to pass in GFP_KERNEL.

Acked-by: Thomas Graf
Signed-off-by: Bob Copeland
[also adjust gfs2/glock.c and rhashtable tests]
Signed-off-by: Johannes Berg

Bob Copeland
2016-04-05 16:56:32 +0800

02 Apr, 2016

2 commits

82d2a348b Merge branch 'for-linus-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs ... Browse Code »

Pull btrfs fixes from Chris Mason:
"This has a few fixes Dave Sterba had queued up. These are all pretty
small, but since they were tested I decided against waiting for more"

* 'for-linus-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: transaction_kthread() is not freezable
btrfs: cleaner_kthread() doesn't need explicit freeze
btrfs: do not write corrupted metadata blocks to disk
btrfs: csum_tree_block: return proper errno value

Linus Torvalds
2016-04-02 07:08:34 +0800
22fed3977 Merge tag 'for-linus' of git://github.com/martinbrandenburg/linux ... Browse Code »

Pull OrangeFS fixes from Martin Brandenburg:
"Two bugfixes for OrangeFS.

One is a reference counting bug and the other is a typo in client
minimum version"

* tag 'for-linus' of git://github.com/martinbrandenburg/linux:
orangefs: minimum userspace version is 2.9.3
orangefs: don't put readdir slot twice

Linus Torvalds
2016-04-02 06:17:32 +0800

01 Apr, 2016

2 commits

878dfd321 orangefs: minimum userspace version is 2.9.3 ... Browse Code »

Version 2.9.4 isn't even released yet.

Signed-off-by: Martin Brandenburg

Martin Brandenburg
2016-04-01 00:06:00 +0800
641bb3246 orangefs: don't put readdir slot twice ... Browse Code »

This was quite an oversight. After a readdir, the module could not be
unloaded, the number of slots is wrong, and memory near the slot bitmap
is possibly corrupt. Oops.

Signed-off-by: Martin Brandenburg

Martin Brandenburg
2016-04-01 00:06:00 +0800

31 Mar, 2016

2 commits

e9dcfaff0 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs fix from Al Viro.

Automount handling was broken by commit e3c13928086f ("namei: massage
lookup_slow() to be usable by lookup_one_len_unlocked()") moving the
test for negative dentry too early.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fix the braino in "namei: massage lookup_slow() to be usable by lookup_one_len_unlocked()"

Linus Torvalds
2016-03-31 20:19:39 +0800
7500c38ac fix the braino in "namei: massage lookup_slow() to be usable by lookup_one_len_unlocked()" ... Browse Code »

We should try to trigger automount *before* bailing out on negative dentry.

Reported-by: Jun'ichi Nomura
Reported-by: Jun'ichi Nomura
Reported-by: Arend van Spriel
Tested-by: Arend van Spriel
Tested-by: Jun'ichi Nomura
Signed-off-by: Al Viro

Al Viro
2016-03-31 12:23:05 +0800

29 Mar, 2016

1 commit

82c7d823c dlm: config: Fix ENOMEM failures in make_cluster() ... Browse Code »

Commit 1ae1602de0 "configfs: switch ->default groups to a linked list"
left the NULL gps pointer behind after removing the kcalloc() call which
made it non-NULL. It also left the !gps check in place so make_cluster()
now fails with ENOMEM. Remove the remaining uses of the gps variable to
fix that.

Reviewed-by: Bob Peterson
Reviewed-by: Andreas Gruenbacher
Signed-off-by: Andrew Price
Signed-off-by: David Teigland

Andrew Price
2016-03-29 23:28:08 +0800

27 Mar, 2016

3 commits

d5a38f6e4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client ... Browse Code »

Pull Ceph updates from Sage Weil:
"There is quite a bit here, including some overdue refactoring and
cleanup on the mon_client and osd_client code from Ilya, scattered
writeback support for CephFS and a pile of bug fixes from Zheng, and a
few random cleanups and fixes from others"

[ I already decided not to pull this because of it having been rebased
recently, but ended up changing my mind after all. Next time I'll
really hold people to it. Oh well. - Linus ]

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (34 commits)
libceph: use KMEM_CACHE macro
ceph: use kmem_cache_zalloc
rbd: use KMEM_CACHE macro
ceph: use lookup request to revalidate dentry
ceph: kill ceph_get_dentry_parent_inode()
ceph: fix security xattr deadlock
ceph: don't request vxattrs from MDS
ceph: fix mounting same fs multiple times
ceph: remove unnecessary NULL check
ceph: avoid updating directory inode's i_size accidentally
ceph: fix race during filling readdir cache
libceph: use sizeof_footer() more
ceph: kill ceph_empty_snapc
ceph: fix a wrong comparison
ceph: replace CURRENT_TIME by current_fs_time()
ceph: scattered page writeback
libceph: add helper that duplicates last extent operation
libceph: enable large, variable-sized OSD requests
libceph: osdc->req_mempool should be backed by a slab pool
libceph: make r_request msg_size calculation clearer
...

Linus Torvalds
2016-03-27 06:53:16 +0800
698f415cf Merge tag 'ofs-pull-tag-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux ... Browse Code »

Pull orangefs filesystem from Mike Marshall.

This finally merges the long-pending orangefs filesystem, which has been
much cleaned up with input from Al Viro over the last six months. From
the documentation file:

"OrangeFS is an LGPL userspace scale-out parallel storage system. It
is ideal for large storage problems faced by HPC, BigData, Streaming
Video, Genomics, Bioinformatics.

Orangefs, originally called PVFS, was first developed in 1993 by Walt
Ligon and Eric Blumer as a parallel file system for Parallel Virtual
Machine (PVM) as part of a NASA grant to study the I/O patterns of
parallel programs.

Orangefs features include:

- Distributes file data among multiple file servers
- Supports simultaneous access by multiple clients
- Stores file data and metadata on servers using local file system
and access methods
- Userspace implementation is easy to install and maintain
- Direct MPI support
- Stateless"

see Documentation/filesystems/orangefs.txt for more in-depth details.

* tag 'ofs-pull-tag-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux: (174 commits)
orangefs: fix orangefs_superblock locking
orangefs: fix do_readv_writev() handling of error halfway through
orangefs: have ->kill_sb() evict the VFS side of things first
orangefs: sanitize ->llseek()
orangefs-bufmap.h: trim unused junk
orangefs: saner calling conventions for getting a slot
orangefs_copy_{to,from}_bufmap(): don't pass bufmap pointer
orangefs: get rid of readdir_handle_s
ornagefs: ensure that truncate has an up to date inode size
orangefs: move code which sets i_link to orangefs_inode_getattr
orangefs: remove needless wrapper around GFP_KERNEL
orangefs: remove wrapper around mutex_lock(&inode->i_mutex)
orangefs: refactor inode type or link_target change detection
orangefs: use new getattr for revalidate and remove old getattr
orangefs: use new getattr in inode getattr and permission
orangefs: use new orangefs_inode_getattr to get size in write and llseek
orangefs: use new orangefs_inode_getattr to create new inodes
orangefs: rename orangefs_inode_getattr to orangefs_inode_old_getattr
orangefs: remove inode->i_lock wrapper
orangefs: put register_chrdev immediately before register_filesystem
...

Linus Torvalds
2016-03-27 03:59:04 +0800
02fc59a0d f2fs/crypto: fix xts_tweak initialization ... Browse Code »

Commit 0b81d07790726 ("fs crypto: move per-file encryption from f2fs
tree to fs/crypto") moved the f2fs crypto files to fs/crypto/ and
renamed the symbol prefixes from "f2fs_" to "fscrypt_" (and from "F2FS_"
to just "FS" for preprocessor symbols).

Because of the symbol renaming, it's a bit hard to see it as a file
move: use

git show -M30 0b81d07790726

to lower the rename detection to just 30% similarity and make git show
the files as renamed (the header file won't be shown as a rename even
then - since all it contains is symbol definitions, it looks almost
completely different).

Even with the renames showing as renames, the diffs are not all that
easy to read, since so much is just the renames. But Eric Biggers
noticed that it's not just all renames: the initialization of the
xts_tweak had been broken too, using the inode number rather than the
page offset.

That's not right - it makes the xfs_tweak the same for all pages of each
inode. It _might_ make sense to make the xfs_tweak contain both the
offset _and_ the inode number, but not just the inode number.

Reported-by: Eric Biggers
Cc: Jaegeuk Kim
Signed-off-by: Linus Torvalds

Linus Torvalds
2016-03-27 01:13:05 +0800

26 Mar, 2016

29 commits

45996492e orangefs: fix orangefs_superblock locking ... Browse Code »

* switch orangefs_remount() to taking ORANGEFS_SB(sb) instead of sb
* remove from the list _before_ orangefs_unmount() - request_mutex
in the latter will make sure that nothing observed in the loop in
ORANGEFS_DEV_REMOUNT_ALL handling will get freed until the end
of loop
* on removal, keep the forward pointer and zero the back one. That
way we can drop and regain the spinlock in the loop body (again,
ORANGEFS_DEV_REMOUNT_ALL one) and still be able to get to the
rest of the list.

Signed-off-by: Al Viro
Signed-off-by: Mike Marshall

Al Viro
2016-03-26 19:22:00 +0800
6d4c1a30b orangefs: fix do_readv_writev() handling of error halfway through ... Browse Code »

Error should only be returned if nothing had been read/written.
Otherwise we need to report a short read/write instead.

Signed-off-by: Al Viro
Signed-off-by: Mike Marshall

Al Viro
2016-03-26 10:30:54 +0800
524b1d309 orangefs: have ->kill_sb() evict the VFS side of things first ... Browse Code »

Signed-off-by: Al Viro
Signed-off-by: Mike Marshall

Al Viro
2016-03-26 10:30:54 +0800
177f8fc49 orangefs: sanitize ->llseek() ... Browse Code »

a) open files can't have NULL inodes
b) it's SEEK_END, not ORANGEFS_SEEK_END; no need to get cute.
c) make_bad_inode() on lseek()?

Signed-off-by: Al Viro
Signed-off-by: Mike Marshall

Al Viro
2016-03-26 10:30:54 +0800
7df240d77 orangefs-bufmap.h: trim unused junk ... Browse Code »

Signed-off-by: Al Viro
Signed-off-by: Mike Marshall

Al Viro
2016-03-26 10:30:54 +0800
b8a99a8f9 orangefs: saner calling conventions for getting a slot ... Browse Code »

just have it return the slot number or -E... - the caller checks
the sign anyway

Signed-off-by: Al Viro
Signed-off-by: Mike Marshall

Al Viro
2016-03-26 10:30:54 +0800
bf6bf606e orangefs_copy_{to,from}_bufmap(): don't pass bufmap pointer ... Browse Code »

it's always __orangefs_bufmap

Signed-off-by: Al Viro
Signed-off-by: Mike Marshall

Al Viro
2016-03-26 10:30:54 +0800
9f5e2f7f1 orangefs: get rid of readdir_handle_s ... Browse Code »

no point, really - we couldn't keep those across the calls of
getdents(); it would be too easy to DoS, having all slots exhausted.

Signed-off-by: Al Viro
Signed-off-by: Mike Marshall

Al Viro
2016-03-26 10:30:54 +0800
102c2595a ocfs2: extend enough credits for freeing one truncate record while replaying truncate records ... Browse Code »

Now function ocfs2_replay_truncate_records() first modifies tl_used,
then calls ocfs2_extend_trans() to extend transactions for gd and alloc
inode used for freeing clusters. jbd2_journal_restart() may be called
and it may happen that tl_used in truncate log is decreased but the
clusters are not freed, which means these clusters are lost. So we
should avoid extending transactions in these two operations.

Signed-off-by: joyce.xue
Reviewed-by: Mark Fasheh
Acked-by: Joseph Qi
Cc: Joel Becker
Cc: Junxiao Bi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xue jiufei
2016-03-26 07:37:42 +0800
172159898 ocfs2: extend transaction for ocfs2_remove_rightmost_path() and ocfs2_update_edg… ... Browse Code »

…e_lengths() before to avoid inconsistency between inode and et

I found that jbd2_journal_restart() is called in some places without
keeping things consistently before. However, jbd2_journal_restart() may
commit the handle's transaction and restart another one. If the first
transaction is committed successfully while another not, it may cause
filesystem inconsistency or read only. This is an effort to fix this
kind of problems.

This patch (of 3):

The following functions will be called while truncating an extent:
ocfs2_remove_btree_range
-> ocfs2_start_trans
-> ocfs2_remove_extent
-> ocfs2_truncate_rec
-> ocfs2_extend_rotate_transaction
-> jbd2_journal_restart if jbd2_journal_extend fail
-> ocfs2_rotate_tree_left
-> ocfs2_remove_rightmost_path
-> ocfs2_extend_rotate_transaction
-> ocfs2_unlink_subtree
-> ocfs2_update_edge_lengths
-> ocfs2_extend_trans
-> jbd2_journal_restart if jbd2_journal_extend fail
-> ocfs2_et_update_clusters
-> ocfs2_commit_trans

jbd2_journal_restart() may be called and it may happened that the buffers
dirtied in ocfs2_truncate_rec() are committed while buffers dirtied in
ocfs2_et_update_clusters() are not, the total clusters on extent tree and
i_clusters in ocfs2_dinode is inconsistency. So the clusters got from
ocfs2_dinode is incorrect, and it also cause read-only problem when call
ocfs2_commit_truncate() with the error message: "Inode %llu has empty
extent block at %llu".

We should extend enough credits for function ocfs2_remove_rightmost_path
and ocfs2_update_edge_lengths to avoid this inconsistency.

Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Acked-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Xue jiufei
2016-03-26 07:37:42 +0800
e5054c9ae ocfs2/dlm: move lock to the tail of grant queue while doing in-place convert ... Browse Code »

We have found a bug when two nodes doing umount one after another.

1) Node 1 migrate a lockres that has 3 locks in grant queue such as
N2(PR)N3(NL)N4(PR) to N2. After migration, lvb of the lock
N3(NL) and N4(PR) are empty on node 2 because migration target do not
copy lvb to these two lock.

2) Node 3 want to convert to PR, it can be granted in
__dlmconvert_master(), and the order of these locks is unchanged. The
lvb of the lock N3(PR) on node 2 is copyed from lockres in function
dlm_update_lvb() while the lvb of lock N4(PR) is still empty.

3) Node 2 want to leave domain, it will migrate this lockres to node 3.
Then node 2 will trigger the BUG in dlm_prepare_lvb_for_migration()
when adding the lock N4(PR) to mres with the following message because
the lvb of mres is already copied from lock N3(PR), but the lvb of lock
N4(PR) is empty.

"Mismatched lvb in lock cookie=%u:%llu, name=%.*s, node=%u"

[akpm@linux-foundation.org: tweak comment]
Signed-off-by: xuejiufei
Acked-by: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Junxiao Bi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

xuejiufei
2016-03-26 07:37:42 +0800
584dca344 ocfs2: solve a problem of crossing the boundary in updating backups ... Browse Code »

In update_backups() there exists a problem of crossing the boundary as
follows:

we assume that lun will be resized to 1TB(cluster_size is 32kb), it will
include 0~33554431 cluster, in update_backups func, it will backup super
block in location of 1TB which is the 33554432th cluster, so the
phenomenon of crossing the boundary happens.

Signed-off-by: Yiwen Jiang
Reviewed-by: Joseph Qi
Cc: Xue jiufei
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

jiangyiwen
2016-03-26 07:37:42 +0800
35ddf78e4 ocfs2: fix occurring deadlock by changing ocfs2_wq from global to local ... Browse Code »

This patch fixes a deadlock, as follows:

Node 1 Node 2 Node 3
1)volume a and b are only mount vol a only mount vol b
mounted

2) start to mount b start to mount a

3) check hb of Node 3 check hb of Node 2
in vol a, qs_holds++ in vol b, qs_holds++

4) -------------------- all nodes' network down --------------------

5) progress of mount b the same situation as
failed, and then call Node 2
ocfs2_dismount_volume.
but the process is hung,
since there is a work
in ocfs2_wq cannot beo
completed. This work is
about vol a, because
ocfs2_wq is global wq.
BTW, this work which is
scheduled in ocfs2_wq is
ocfs2_orphan_scan_work,
and the context in this work
needs to take inode lock
of orphan_dir, because
lockres owner are Node 1 and
all nodes' nework has been down
at the same time, so it can't
get the inode lock.

6) Why can't this node be fenced
when network disconnected?
Because the process of
mount is hung what caused qs_holds
is not equal 0.

Because all works in the ocfs2_wq are relative to the super block.

The solution is to change the ocfs2_wq from global to local. In other
words, move it into struct ocfs2_super.

Signed-off-by: Yiwen Jiang
Reviewed-by: Joseph Qi
Cc: Xue jiufei
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Cc: Junxiao Bi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

jiangyiwen
2016-03-26 07:37:42 +0800
be12b299a ocfs2/dlm: fix BUG in dlm_move_lockres_to_recovery_list ... Browse Code »

When master handles convert request, it queues ast first and then
returns status. This may happen that the ast is sent before the request
status because the above two messages are sent by two threads. And
right after the ast is sent, if master down, it may trigger BUG in
dlm_move_lockres_to_recovery_list in the requested node because ast
handler moves it to grant list without clear lock->convert_pending. So
remove BUG_ON statement and check if the ast is processed in
dlmconvert_remote.

Signed-off-by: Joseph Qi
Reported-by: Yiwen Jiang
Cc: Junxiao Bi
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Tariq Saeed
Cc: Junxiao Bi
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joseph Qi
2016-03-26 07:37:42 +0800
ac7cf246d ocfs2/dlm: fix race between convert and recovery ... Browse Code »

There is a race window between dlmconvert_remote and
dlm_move_lockres_to_recovery_list, which will cause a lock with
OCFS2_LOCK_BUSY in grant list, thus system hangs.

dlmconvert_remote
{
spin_lock(&res->spinlock);
list_move_tail(&lock->list, &res->converting);
lock->convert_pending = 1;
spin_unlock(&res->spinlock);

status = dlm_send_remote_convert_request();
>>>>>> race window, master has queued ast and return DLM_NORMAL,
and then down before sending ast.
this node detects master down and calls
dlm_move_lockres_to_recovery_list, which will revert the
lock to grant list.
Then OCFS2_LOCK_BUSY won't be cleared as new master won't
send ast any more because it thinks already be authorized.

spin_lock(&res->spinlock);
lock->convert_pending = 0;
if (status != DLM_NORMAL)
dlm_revert_pending_convert(res, lock);
spin_unlock(&res->spinlock);
}

In this case, check if res->state has DLM_LOCK_RES_RECOVERING bit set
(res is still in recovering) or res master changed (new master has
finished recovery), reset the status to DLM_RECOVERING, then it will
retry convert.

Signed-off-by: Joseph Qi
Reported-by: Yiwen Jiang
Reviewed-by: Junxiao Bi
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Tariq Saeed
Cc: Junxiao Bi
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joseph Qi
2016-03-26 07:37:42 +0800
28888681b ocfs2: fix a deadlock issue in ocfs2_dio_end_io_write() ... Browse Code »

The code should call ocfs2_free_alloc_context() to free meta_ac &
data_ac before calling ocfs2_run_deallocs(). Because
ocfs2_run_deallocs() will acquire the system inode's i_mutex hold by
meta_ac. So try to release the lock before ocfs2_run_deallocs().

Fixes: af1310367f41 ("ocfs2: fix sparse file & data ordering issue in direct io.")
Signed-off-by: Ryan Ding
Acked-by: Junxiao Bi
Cc: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ryan Ding
2016-03-26 07:37:42 +0800
ce170828e ocfs2: fix disk file size and memory file size mismatch ... Browse Code »

When doing append direct write in an already allocated cluster, and fast
path in ocfs2_dio_get_block() is triggered, function
ocfs2_dio_end_io_write() will be skipped as there is no context
allocated.

As a result, the disk file size will not be changed as it should be.
The solution is to skip fast path when we are about to change file size.

Fixes: af1310367f41 ("ocfs2: fix sparse file & data ordering issue in direct io.")
Signed-off-by: Ryan Ding
Acked-by: Junxiao Bi
Cc: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ryan Ding
2016-03-26 07:37:42 +0800
a86a72a4a ocfs2: take ip_alloc_sem in ocfs2_dio_get_block & ocfs2_dio_end_io_write ... Browse Code »

Take ip_alloc_sem to prevent concurrent access to extent tree, which may
cause the extent tree in an unstable state.

Signed-off-by: Ryan Ding
Reviewed-by: Junxiao Bi
Cc: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ryan Ding
2016-03-26 07:37:42 +0800
e63890f38 ocfs2: fix ip_unaligned_aio deadlock with dio work queue ... Browse Code »

In the current implementation of unaligned aio+dio, lock order behave as
follow:

in user process context:
-> call io_submit()
-> get i_mutex
get ip_unaligned_aio
-> submit direct io to block device
-> release i_mutex
-> io_submit() return

in dio work queue context(the work queue is created in __blockdev_direct_IO):
-> release ip_unaligned_aio
get i_mutex
-> clear unwritten flag & change i_size
-> release i_mutex

There is a limitation to the thread number of dio work queue. 256 at
default. If all 256 thread are in the above 'window2' stage, and there
is a user process in the 'window1' stage, the system will became
deadlock. Since the user process hold i_mutex to wait ip_unaligned_aio
lock, while there is a direct bio hold ip_unaligned_aio mutex who is
waiting for a dio work queue thread to be schedule. But all the dio
work queue thread is waiting for i_mutex lock in 'window2'.

This case only happened in a test which send a large number(more than
256) of aio at one io_submit() call.

My design is to remove ip_unaligned_aio lock. Change it to a sync io
instead. Just like ip_unaligned_aio lock, serialize the unaligned aio
dio.

[akpm@linux-foundation.org: remove OCFS2_IOCB_UNALIGNED_IO, per Junxiao Bi]
Signed-off-by: Ryan Ding
Reviewed-by: Junxiao Bi
Cc: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ryan Ding
2016-03-26 07:37:42 +0800
f1f973ffc ocfs2: code clean up for direct io ... Browse Code »

Clean up ocfs2_file_write_iter & ocfs2_prepare_inode_for_write:
* remove append dio check: it will be checked in ocfs2_direct_IO()
* remove file hole check: file hole is supported for now
* remove inline data check: it will be checked in ocfs2_direct_IO()
* remove the full_coherence check when append dio: we will get the
inode_lock in ocfs2_dio_get_block, there is no need to fall back to
buffer io to ensure the coherence semantics.

Now the drop dio procedure is gone. :)

[akpm@linux-foundation.org: remove unused label]
Signed-off-by: Ryan Ding
Reviewed-by: Junxiao Bi
Cc: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ryan Ding
2016-03-26 07:37:42 +0800
c15471f79 ocfs2: fix sparse file & data ordering issue in direct io ... Browse Code »

There are mainly three issues in the direct io code path after commit
24c40b329e03 ("ocfs2: implement ocfs2_direct_IO_write"):

* Does not support sparse file.
* Does not support data ordering. eg: when write to a file hole, it
will alloc extent first. If system crashed before io finished, data
will corrupt.
* Potential risk when doing aio+dio. The -EIOCBQUEUED return value is
likely to be ignored by ocfs2_direct_IO_write().

To resolve above problems, re-design direct io code with following ideas:
* Use buffer io to fill in holes. And this will make better
performance also.
* Clear unwritten after direct write finished. So we can make sure
meta data changes after data write to disk. (Unwritten extent is
invisible to user, from user's view, meta data is not changed when
allocate an unwritten extent.)
* Clear ocfs2_direct_IO_write(). Do all ending work in end_io.

This patch has passed fs,dio,ltp-aiodio.part1,ltp-aiodio.part2,ltp-aiodio.part4
test cases of ltp.

For performance improvement, see following test result:
ocfs2 cluster size 1MB, ocfs2 volume is mounted on /mnt/.
The original way:
+ rm /mnt/test.img -f
+ dd if=/dev/zero of=/mnt/test.img bs=4K count=1048576 oflag=direct
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB) copied, 1707.83 s, 2.5 MB/s
+ rm /mnt/test.img -f
+ dd if=/dev/zero of=/mnt/test.img bs=256K count=16384 oflag=direct
16384+0 records in
16384+0 records out
4294967296 bytes (4.3 GB) copied, 582.705 s, 7.4 MB/s

After this patch:
+ rm /mnt/test.img -f
+ dd if=/dev/zero of=/mnt/test.img bs=4K count=1048576 oflag=direct
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB) copied, 64.6412 s, 66.4 MB/s
+ rm /mnt/test.img -f
+ dd if=/dev/zero of=/mnt/test.img bs=256K count=16384 oflag=direct
16384+0 records in
16384+0 records out
4294967296 bytes (4.3 GB) copied, 34.7611 s, 124 MB/s

Signed-off-by: Ryan Ding
Reviewed-by: Junxiao Bi
Cc: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ryan Ding
2016-03-26 07:37:42 +0800
4506cfb6f ocfs2: record UNWRITTEN extents when populate write desc ... Browse Code »

To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock.

There is still one issue in the direct write procedure.

phase 1: alloc extent with UNWRITTEN flag
phase 2: submit direct data to disk, add zero page to page cache
phase 3: clear UNWRITTEN flag when data has been written to disk

When there are 2 direct write A(0~3KB),B(4~7KB) writing to the same
cluster 0~7KB (cluster size 8KB). Write request A arrive phase 2 first,
it will zero the region (4~7KB). Before request A enter to phase 3,
request B arrive phase 2, it will zero region (0~3KB). This is just like
request B steps request A.

To resolve this issue, we should let request B knows this cluster is already
under zero, to prevent it from steps the previous write request.

This patch will add function ocfs2_unwritten_check() to do this job. It
will record all clusters that are under direct write(it will be recorded
in the 'ip_unwritten_list' member of inode info), and prevent the later
direct write writing to the same cluster to do the zero work again.

Signed-off-by: Ryan Ding
Reviewed-by: Junxiao Bi
Cc: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ryan Ding
2016-03-26 07:37:42 +0800
2de6a3c73 ocfs2: return the physical address in ocfs2_write_cluster ... Browse Code »

To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock.

Direct io needs to get the physical address from write_begin, to map the
user page. This patch is to change the arg 'phys' of
ocfs2_write_cluster to a pointer, so it can be retrieved to write_begin.
And we can retrieve it to the direct io procedure.

Signed-off-by: Ryan Ding
Reviewed-by: Junxiao Bi
Cc: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ryan Ding
2016-03-26 07:37:42 +0800
46e625565 ocfs2: do not change i_size in write_end for direct io ... Browse Code »

To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock.

Append direct io do not change i_size in get block phase. It only move
to orphan when starting write. After data is written to disk, it will
delete itself from orphan and update i_size. So skip i_size change
section in write_begin for direct io.

And when there is no extents alloc, no meta data changes needed for
direct io (since write_begin start trans for 2 reason: alloc extents &
change i_size. Now none of them needed). So we can skip start trans
procedure.

Signed-off-by: Ryan Ding
Reviewed-by: Junxiao Bi
Cc: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ryan Ding
2016-03-26 07:37:42 +0800
65c4db8c8 ocfs2: test target page before change it ... Browse Code »

To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock.

Direct io data will not appear in buffer. The w_target_page member will
not be filled by direct io. So avoid to use it when it's NULL. Unlinke
buffer io and mmap, direct io will call write_begin with more than 1
page a time. So the target_index is not sufficient to describe the
actual data. change it to a range start at target_index, end in
end_index.

Signed-off-by: Ryan Ding
Reviewed-by: Junxiao Bi
Cc: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ryan Ding
2016-03-26 07:37:42 +0800
b46637d59 ocfs2: use c_new to indicate newly allocated extents ... Browse Code »

To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock.

There is a problem in ocfs2's direct io implement: if system crashed
after extents allocated, and before data return, we will get a extent
with dirty data on disk. This problem violate the journal=order
semantics, which means meta changes take effect after data written to
disk. To resolve this issue, direct write can use the UNWRITTEN flag to
describe a extent during direct data writeback. The direct write
procedure should act in the following order:

phase 1: alloc extent with UNWRITTEN flag
phase 2: submit direct data to disk, add zero page to page cache
phase 3: clear UNWRITTEN flag when data has been written to disk

This patch is to change the 'c_unwritten' member of
ocfs2_write_cluster_desc to 'c_clear_unwritten'. Means whether to clear
the unwritten flag. It do not care if a extent is allocated or not.
And use 'c_new' to specify a newly allocated extent. So the direct io
procedure can use c_clear_unwritten to control the UNWRITTEN bit on
extent.

Signed-off-by: Ryan Ding
Reviewed-by: Junxiao Bi
Cc: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ryan Ding
2016-03-26 07:37:42 +0800
c1ad1e3ca ocfs2: add ocfs2_write_type_t type to identify the caller of write ... Browse Code »

Patchset: fix ocfs2 direct io code patch to support sparse file and data
ordering semantics

The idea is to use buffer io(more precisely use the interface
ocfs2_write_begin_nolock & ocfs2_write_end_nolock) to do the zero work
beyond block size. And clear UNWRITTEN flag until direct io data has
been written to disk, which can prevent data corruption when system
crashed during direct write.

And we will also archive a better performance: eg. dd direct write new
file with block size 4KB: before this patchset:
2.5 MB/s
after this patchset:
66.4 MB/s

This patch (of 8):

To support direct io in ocfs2_write_begin_nolock &
ocfs2_write_end_nolock.

Remove unused args filp & flags. Add new arg type. The type is one of
buffer/direct/mmap. Indicate 3 way to perform write. buffer/mmap type
has implemented. direct type will be implemented later.

Signed-off-by: Ryan Ding
Reviewed-by: Junxiao Bi
Cc: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ryan Ding
2016-03-26 07:37:42 +0800
9e13f1f9d ocfs2: o2hb: fix double free bug ... Browse Code »

This is a regression issue and caused the following kernel panic when do
ocfs2 multiple test.

BUG: unable to handle kernel paging request at 00000002000800c0
IP: [] kmem_cache_alloc+0x78/0x160
PGD 7bbe5067 PUD 0
Oops: 0000 [#1] SMP
Modules linked in: ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi xen_kbdfront xen_netfront xen_fbfront xen_blkfront
CPU: 2 PID: 4044 Comm: mpirun Not tainted 4.5.0-rc5-next-20160225 #1
Hardware name: Xen HVM domU, BIOS 4.3.1OVM 05/14/2014
task: ffff88007a521a80 ti: ffff88007aed0000 task.ti: ffff88007aed0000
RIP: 0010:[] [] kmem_cache_alloc+0x78/0x160
RSP: 0018:ffff88007aed3a48 EFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000001991
RDX: 0000000000001990 RSI: 00000000024000c0 RDI: 000000000001b330
RBP: ffff88007aed3a98 R08: ffff88007d29b330 R09: 00000002000800c0
R10: 0000000c51376d87 R11: ffff8800792cac38 R12: ffff88007cc30f00
R13: 00000000024000c0 R14: ffffffff811b053f R15: ffff88007aed3ce7
FS: 0000000000000000(0000) GS:ffff88007d280000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000002000800c0 CR3: 000000007aeb2000 CR4: 00000000000406e0
Call Trace:
__d_alloc+0x2f/0x1a0
d_alloc+0x17/0x80
lookup_dcache+0x8a/0xc0
path_openat+0x3c3/0x1210
do_filp_open+0x80/0xe0
do_sys_open+0x110/0x200
SyS_open+0x19/0x20
do_syscall_64+0x72/0x230
entry_SYSCALL64_slow_path+0x25/0x25
Code: 05 e6 77 e7 7e 4d 8b 08 49 8b 40 10 4d 85 c9 0f 84 dd 00 00 00 48 85 c0 0f 84 d4 00 00 00 49 63 44 24 20 49 8b 3c 24 48 8d 4a 01 8b 1c 01 4c 89 c8 65 48 0f c7 0f 0f 94 c0 3c 01 75 b6 49 63
RIP kmem_cache_alloc+0x78/0x160
CR2: 00000002000800c0
---[ end trace 823969e602e4aaac ]---

Fixes: a4a1dfa4bb8b("ocfs2/cluster: fix memory leak in o2hb_region_release")
Signed-off-by: Junxiao Bi
Reviewed-by: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Junxiao Bi
2016-03-26 07:37:42 +0800
99ec26977 ceph: use kmem_cache_zalloc ... Browse Code »

Use kmem_cache_zalloc() instead of kmem_cache_alloc() with flag GFP_ZERO.

Signed-off-by: Geliang Tang
Signed-off-by: Ilya Dryomov

Geliang Tang
2016-03-26 01:51:56 +0800