31 Jan, 2014
1 commit
-
Pull core block IO changes from Jens Axboe:
"The major piece in here is the immutable bio_ve series from Kent, the
rest is fairly minor. It was supposed to go in last round, but
various issues pushed it to this release instead. The pull request
contains:- Various smaller blk-mq fixes from different folks. Nothing major
here, just minor fixes and cleanups.- Fix for a memory leak in the error path in the block ioctl code
from Christian Engelmayer.- Header export fix from CaiZhiyong.
- Finally the immutable biovec changes from Kent Overstreet. This
enables some nice future work on making arbitrarily sized bios
possible, and splitting more efficient. Related fixes to immutable
bio_vecs:- dm-cache immutable fixup from Mike Snitzer.
- btrfs immutable fixup from Muthu Kumar.- bio-integrity fix from Nic Bellinger, which is also going to stable"
* 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits)
xtensa: fixup simdisk driver to work with immutable bio_vecs
block/blk-mq-cpu.c: use hotcpu_notifier()
blk-mq: for_each_* macro correctness
block: Fix memory leak in rw_copy_check_uvector() handling
bio-integrity: Fix bio_integrity_verify segment start bug
block: remove unrelated header files and export symbol
blk-mq: uses page->list incorrectly
blk-mq: use __smp_call_function_single directly
btrfs: fix missing increment of bi_remaining
Revert "block: Warn and free bio if bi_end_io is not set"
block: Warn and free bio if bi_end_io is not set
blk-mq: fix initializing request's start time
block: blk-mq: don't export blk_mq_free_queue()
block: blk-mq: make blk_sync_queue support mq
block: blk-mq: support draining mq queue
dm cache: increment bi_remaining when bi_end_io is restored
block: fixup for generic bio chaining
block: Really silence spurious compiler warnings
block: Silence spurious compiler warnings
block: Kill bio_pair_split()
...
29 Jan, 2014
1 commit
-
Pull vfs updates from Al Viro:
"Assorted stuff; the biggest pile here is Christoph's ACL series. Plus
assorted cleanups and fixes all over the place...There will be another pile later this week"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (43 commits)
__dentry_path() fixes
vfs: Remove second variable named error in __dentry_path
vfs: Is mounted should be testing mnt_ns for NULL or error.
Fix race when checking i_size on direct i/o read
hfsplus: remove can_set_xattr
nfsd: use get_acl and ->set_acl
fs: remove generic_acl
nfs: use generic posix ACL infrastructure for v3 Posix ACLs
gfs2: use generic posix ACL infrastructure
jfs: use generic posix ACL infrastructure
xfs: use generic posix ACL infrastructure
reiserfs: use generic posix ACL infrastructure
ocfs2: use generic posix ACL infrastructure
jffs2: use generic posix ACL infrastructure
hfsplus: use generic posix ACL infrastructure
f2fs: use generic posix ACL infrastructure
ext2/3/4: use generic posix ACL infrastructure
btrfs: use generic posix ACL infrastructure
fs: make posix_acl_create more useful
fs: make posix_acl_chmod more useful
...
26 Jan, 2014
3 commits
-
f2fs has some weird mode bit handling, so still using the old
chmod code for now.Signed-off-by: Christoph Hellwig
Reviewed-by: Jaegeuk Kim
Signed-off-by: Al Viro -
Rename the current posix_acl_created to __posix_acl_create and add
a fully featured helper to set up the ACLs on file creation that
uses get_acl().Signed-off-by: Christoph Hellwig
Reviewed-by: Jan Kara
Signed-off-by: Al Viro -
Rename the current posix_acl_chmod to __posix_acl_chmod and add
a fully featured ACL chmod helper that uses the ->set_acl inode
operation.Signed-off-by: Christoph Hellwig
Reviewed-by: Jan Kara
Signed-off-by: Al Viro
23 Jan, 2014
1 commit
-
If a node page is trucated, we'd better drop the page in the node_inode's page
cache for better memory footprint.Signed-off-by: Jaegeuk Kim
22 Jan, 2014
5 commits
-
This patch adds NODE_MAPPING which is similar as META_MAPPING introduced by
Gu Zheng.Cc: Gu Zheng
Signed-off-by: Jaegeuk Kim -
As the orphan_blocks may be max to 504, so it is not security
and rigorous to store such a large array in the kernel stack
as Dan Carpenter said.
In fact, grab_meta_page has locked the page in the page cache,
and we can use find_get_page() to fetch the page safely in the
downstream, so we can remove the page array directly.Reported-by: Dan Carpenter
Signed-off-by: Gu Zheng
Signed-off-by: Jaegeuk Kim -
Introduce help function META_MAPPING() to get the cache meta blocks'
address space.Signed-off-by: Gu Zheng
Signed-off-by: Jaegeuk Kim -
This patch moves a function in f2fs_delete_entry for code readability.
Signed-off-by: Jaegeuk Kim
-
If a dentry page is updated, we should call mark_inode_dirty to add the inode
into the dirty list, so that its dentry pages are flushed to the disk.
Otherwise, the inode can be evicted without flush.Signed-off-by: Jaegeuk Kim
20 Jan, 2014
1 commit
-
Fixed a variety of trivial checkpatch warnings. The only delta should
be some minor formatting on log strings that were split / too long.Signed-off-by: Chris Fries
Signed-off-by: Jaegeuk Kim
16 Jan, 2014
2 commits
-
Doing sync_meta_pages with META_FLUSH when checkpoint, we overide rw
using WRITE_FLUSH_FUA. At this time, we also should set
REQ_META|REQ_PRIO.Signed-off-by: Changman Lee
Signed-off-by: Jaegeuk Kim -
This patch should resolve the following bug.
=========================================================
[ INFO: possible irq lock inversion dependency detected ]
3.13.0-rc5.f2fs+ #6 Not tainted
---------------------------------------------------------
kswapd0/41 just changed the state of lock:
(&sbi->gc_mutex){+.+.-.}, at: [] f2fs_balance_fs+0xae/0xd0 [f2fs]
but this lock took another, RECLAIM_FS-READ-unsafe lock in the past:
(&sbi->cp_rwsem){++++.?}and interrupts could create inverse lock ordering between them.
other info that might help us debug this:
Chain exists of:
&sbi->gc_mutex --> &sbi->cp_mutex --> &sbi->cp_rwsemPossible interrupt unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&sbi->cp_rwsem);
local_irq_disable();
lock(&sbi->gc_mutex);
lock(&sbi->cp_mutex);
lock(&sbi->gc_mutex);*** DEADLOCK ***
This bug is due to the f2fs_balance_fs call in f2fs_write_data_page.
If f2fs_write_data_page is triggered by wbc->for_reclaim via kswapd, it should
not call f2fs_balance_fs which tries to get a mutex grabbed by original syscall
flow.Signed-off-by: Jaegeuk Kim
14 Jan, 2014
5 commits
-
Support for f2fs-tools/tools/f2stat to monitor
/sys/kernel/debug/f2fs/statusSigned-off-by: Changman Lee
Signed-off-by: Jaegeuk Kim -
With the 2 previous changes, all the long time operations are moved out
of the protection region, so here we can use spinlock rather than mutex
(orphan_inode_mutex) for lower overhead.Signed-off-by: Gu Zheng
Signed-off-by: Jaegeuk Kim -
Move alloc new orphan node out of lock protection region.
Signed-off-by: Gu Zheng
Signed-off-by: Jaegeuk Kim -
Move grabing orphan block page out of protection region, and grab all
the orphan block pages ahead.Signed-off-by: Gu Zheng
Reviewed-by: Chao Yu
[Jaegeuk Kim: remove unnecessary code pointed by Chao Yu]
Signed-off-by: Jaegeuk Kim -
"boo sync" parameter is never referenced in f2fs_wait_on_page_writeback.
We should remove this parameter.Signed-off-by: Yuan Zhong
Signed-off-by: Jaegeuk Kim
08 Jan, 2014
2 commits
-
Previously during SSR and GC, the maximum number of retrials to find a victim
segment was hard-coded by MAX_VICTIM_SEARCH, 4096 by default.This number makes an effect on IO locality, when SSR mode is activated, which
results in performance fluctuation on some low-end devices.If max_victim_search = 4, the victim will be searched like below.
("D" represents a dirty segment, and "*" indicates a selected victim segment.)D1 D2 D3 D4 D5 D6 D7 D8 D9
[ * ]
[ * ]
[ * ]
[ ....]This patch adds a sysfs entry to control the number dynamically through:
/sys/fs/f2fs/$dev/max_victim_searchSigned-off-by: Jaegeuk Kim
-
When considering a bunch of data writes with very frequent fsync calls, we
are able to think the following performance regression.N: Node IO, D: Data IO, IO scheduler: cfq
Issue pending IOs
D1 D2 D3 D4
D1 D2 D3 D4 N1
D2 D3 D4 N1 N2
N1 D3 D4 N2 D1
--> N1 can be selected by cfq becase of the same priority of N and D.
Then D3 and D4 would be delayed, resuling in performance degradation.So, when processing the fsync call, it'd better give higher priority to data IOs
than node IOs by assigning WRITE and WRITE_SYNC respectively.
This patch improves the random wirte performance with frequent fsync calls by up
to 10%.Signed-off-by: Jaegeuk Kim
06 Jan, 2014
9 commits
-
Here is a case which could read inline page data not from first page.
1. write inline data
2. lseek to offset 4096
3. read 4096 bytes from offset 4096
(read_inline_data read inline data page to non-first page,
And previously VFS has add this page to page cache)
4. ftruncate offset 8192
5. read 4096 bytes from offset 4096
(we meet this updated page with inline data in cache)So we should leave this page with inited data and uptodate flag
for this case.Change log from v1:
o fix a deadlock bugSigned-off-by: Chao Yu
Signed-off-by: Jaegeuk Kim -
Change log from v1:
o reduce unneeded memset in __f2fs_convert_inline_data>From 58796be2bd2becbe8d52305210fb2a64e7dd80b6 Mon Sep 17 00:00:00 2001
From: Chao Yu
Date: Mon, 30 Dec 2013 09:21:33 +0800
Subject: [PATCH] f2fs: avoid to left uninitialized data in page when read
inline dataWe left uninitialized data in the tail of page when we read an inline data
page. So let's initialize left part of the page excluding inline data region.Signed-off-by: Chao Yu
Signed-off-by: Jaegeuk Kim -
The truncate_partial_nodes puts pages incorrectly in the following two cases.
Note that the value for argc 'depth' can only be 2 or 3.
Please see truncate_inode_blocks() and truncate_partial_nodes().1) An err is occurred in the first 'for' loop
When err is occurred with depth = 2, pages[0] is invalid, so this page doesn't
need to be put. There is no problem, however, when depth is 3, it doesn't put
the pages correctly where pages[0] is valid and pages[1] is invalid.
In this case, depth is set to 2 (ref to statemnt depth = i + 1), and then
'goto fail'.
In label 'fail', for (i = depth - 3; i >= 0; i--) cannot meet the condition
because i = -1, so pages[0] cann't be put.2) An err happened in the second 'for' loop
Now we've got pages[0] with depth = 2, or we've got pages[0] and pages[1]
with depth = 3. When an err is detected, we need 'goto fail' to put such
the pages.
When depth is 2, in label 'fail', for (i = depth - 3; i >= 0; i--) cann't
meet the condition because i = -1, so pages[0] cann't be put.
When depth is 3, in label 'fail', for (i = depth - 3; i >= 0; i--) can
only put pages[0], pages[1] also cann't be put.Note that 'depth' has been changed before first 'goto fail' (ref to statemnt
depth = i + 1), so passing this modified 'depth' to the tracepoint,
trace_f2fs_truncate_partial_nodes, is also incorrect.Signed-off-by: Shifei Ge
[Jaegeuk Kim: modify the description and fix one bug]
Signed-off-by: Jaegeuk Kim -
The get_dnode_of_data nullifies inode and node page when error is occurred.
There are two cases that passes inode page into get_dnode_of_data().
1. make_empty_dir()
-> get_new_data_page()
-> f2fs_reserve_block(ipage)
-> get_dnode_of_data()2. f2fs_convert_inline_data()
-> __f2fs_convert_inline_data()
-> f2fs_reserve_block(ipage)
-> get_dnode_of_data()This patch adds correct error handling codes when get_dnode_of_data() returns
an error.At first, f2fs_reserve_block() calls f2fs_put_dnode() whenever reserve_new_block
returns an error.
So, the rule of f2fs_reserve_block() is to nullify inode page when there is any
error internally.Finally, two callers of f2fs_reserve_block() should call f2fs_put_dnode()
appropriately if they got an error since successful f2fs_reserve_block().Signed-off-by: Jaegeuk Kim
-
This patch adds a inline_data recovery routine with the following policy.
[prev.] [next] of inline_data flag
o o -> recover inline_data
o x -> remove inline_data, and then recover data blocks
x o -> remove inline_data, and then recover inline_data
x x -> recover data blocksSigned-off-by: Jaegeuk Kim
-
This patch adds the number of inline_data files into the status information.
Note that the number is reset whenever the filesystem is newly mounted.Signed-off-by: Jaegeuk Kim
-
Change log from v1:
o handle NULL pointer of grab_cache_page_write_begin() pointed by Chao Yu.This patch refactors f2fs_convert_inline_data to check a couple of conditions
internally for deciding whether it needs to convert inline_data or not.So, the new f2fs_convert_inline_data initially checks:
1) f2fs_has_inline_data(), and
2) the data size to be changed.If the inode has inline_data but the size to fill is less than MAX_INLINE_DATA,
then we don't need to convert the inline_data with data allocation.Signed-off-by: Jaegeuk Kim
-
In f2fs_write_begin(), if f2fs_conver_inline_data() returns an error like
-ENOSPC, f2fs should call f2fs_put_page().
Otherwise, it is remained as a locked page, resulting in the following bug.[] sleep_on_page+0xe/0x20
[] __lock_page+0x67/0x70
[] truncate_inode_pages_range+0x368/0x5d0
[] truncate_inode_pages+0x15/0x20
[] truncate_pagecache+0x4b/0x70
[] truncate_setsize+0x12/0x20
[] f2fs_setattr+0x72/0x270 [f2fs]
[] notify_change+0x213/0x400
[] do_truncate+0x66/0xa0
[] vfs_truncate+0x191/0x1b0
[] do_sys_truncate+0x5c/0xa0
[] SyS_truncate+0xe/0x10
[] system_call_fastpath+0x16/0x1b
[] 0xffffffffffffffffSigned-off-by: Jaegeuk Kim
-
In the punch_hole(), let's convert inline_data all the time for simplicity and
to avoid potential deadlock conditions.
It is pretty much not a big deal to do this.Reviewed-by: Chao Yu
Signed-off-by: Jaegeuk Kim
27 Dec, 2013
1 commit
-
This patch locates checking the inline_data prior to calling f2fs_lock_op()
in truncate_blocks(), since getting the lock is unnecessary.Signed-off-by: Jaegeuk Kim
26 Dec, 2013
7 commits
-
Hook inline data read/write, truncate, fallocate, setattr, etc.
Files need meet following 2 requirement to inline:
1) file size is not greater than MAX_INLINE_DATA;
2) file doesn't pre-allocate data blocks by fallocate().FI_INLINE_DATA will not be set while creating a new regular inode because
most of the files are bigger than ~3.4K. Set FI_INLINE_DATA only when
data is submitted to block layer, ranther than set it while creating a new
inode, this also avoids converting data from inline to normal data block
and vice versa.While writting inline data to inode block, the first data block should be
released if the file has a block indexed by i_addr[0].On the other hand, when a file operation is appied to a file with inline
data, we need to test if this file can remain inline by doing this
operation, otherwise it should be convert into normal file by reserving
a new data block, copying inline data to this new block and clear
FI_INLINE_DATA flag. Because reserve a new data block here will make use
of i_addr[0], if we save inline data in i_addr[0..872], then the first
4 bytes would be overwriten. This problem can be avoided simply by
not using i_addr[0] for inline data.Signed-off-by: Huajun Li
Signed-off-by: Haicheng Li
Signed-off-by: Weihong Xu
Signed-off-by: Jaegeuk Kim -
Functions to implement inline data read/write, and move inline data to
normal data block when file size exceeds inline data limitation.Signed-off-by: Huajun Li
Signed-off-by: Haicheng Li
Signed-off-by: Weihong Xu
Signed-off-by: Jaegeuk Kim -
Previously, we need to calculate the max orphan num when we try to acquire an
orphan inode, but it's a stable value since the super block was inited. So
converting it to a field of f2fs_sb_info and use it directly when needed seems
a better choose.Signed-off-by: Gu Zheng
Signed-off-by: Jaegeuk Kim -
The f2fs supports 4KB block size. If user requests dwrite with under 4KB data,
it allocates a new 4KB data block.
However, f2fs doesn't add zero data into the untouched data area inside the
newly allocated data block.This incurs an error during the xfstest #263 test as follow.
263 12s ... [failed, exit status 1] - output mismatch (see 263.out.bad)
--- 263.out 2013-03-09 03:37:15.043967603 +0900
+++ 263.out.bad 2013-12-27 04:20:39.230203114 +0900
@@ -1,3 +1,976 @@
QA output created by 263
fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z
-fsx -N 10000 -o 128000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z
+fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z
+truncating to largest ever: 0x12a00
+truncating to largest ever: 0x75400
+fallocating to largest ever: 0x79cbf
...
(Run 'diff -u 263.out 263.out.bad' to see the entire diff)
Ran: 263
Failures: 263
Failed 1 of 1 testsIt turns out that, when the test tries to write 2KB data with dio, the new dio
path allocates 4KB data block without filling zero data inside the remained 2KB
area. Finally, the output file contains a garbage data for that region.Signed-off-by: Jaegeuk Kim
-
When get_dnode_of_data() in get_data_block() returns a successful dnode, we
should put the dnode.
But, previously, if its data block address is equal to NEW_ADDR, we didn't do
that, resulting in a deadlock condition.
So, this patch splits original error conditions with this case, and then calls
f2fs_put_dnode before finishing the function.Signed-off-by: Jaegeuk Kim
-
This patch introduces F2FS_INODE that returns struct f2fs_inode * from the inode
page.
By using this macro, we can remove unnecessary casting codes like below.struct f2fs_inode *ri = &F2FS_NODE(inode_page)->i;
-> struct f2fs_inode *ri = F2FS_INODE(inode_page);Reviewed-by: Chao Yu
Signed-off-by: Jaegeuk Kim -
In current flow, we will get Null return value of f2fs_find_entry in
recover_dentry when name.len is bigger than F2FS_NAME_LEN, and then we
still add this inode into its dir entry.
To avoid this situation, we must check filename length before we use it.Another point is that we could remove the code of checking filename length
In f2fs_find_entry, because f2fs_lookup will be called previously to ensure of
validity of filename length.V2:
o add WARN_ON() as Jaegeuk Kim suggested.Signed-off-by: Chao Yu
Signed-off-by: Jaegeuk Kim
23 Dec, 2013
2 commits
-
When we rename a dir to new name which is not exist previous,
we will set pino of parent inode with ino of child inode in f2fs_set_link.
It destroy consistency of pino, it should be fixed.Thanks for previous work of Shu Tan.
Signed-off-by: Shu Tan
Signed-off-by: Chao Yu
Signed-off-by: Jaegeuk Kim -
Update several comments:
1. use f2fs_{un}lock_op install of mutex_{un}lock_op.
2. update comment of get_data_block().
3. update description of node offset.Signed-off-by: Chao Yu
Signed-off-by: Jaegeuk Kim