23 Oct, 2015

3 commits

  • In f2fs_shrink_extent_tree we should stop shrink flow if we have already
    shrunk enough nodes in extent cache.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • Now, in ->symlink of f2fs, we kept the fixed invoking order between
    f2fs_add_link and page_symlink since we should init node info firstly
    in f2fs_add_link, then such node info can be used in page_symlink.

    But we didn't fix to release meta info which was done before page_symlink
    in our error path, so this will leave us corrupt symlink entry in its
    parent's dentry page. Fix this issue by adding f2fs_unlink in the error
    path for removing such linking.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • Atomic write page can be GCed, after committing this kind of page, we should
    clear the GCed flag for it.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     

22 Oct, 2015

3 commits


21 Oct, 2015

2 commits


14 Oct, 2015

2 commits

  • Once f2fs_gc is done, wait_ms is changed once more.
    So, its tracepoint would be located after it.

    Reported-by: He YunLei
    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • different competitors

    Since we use different page cache (normally inode's page cache for R/W
    and meta inode's page cache for GC) to cache the same physical block
    which is belong to an encrypted inode. Writeback of these two page
    cache should be exclusive, but now we didn't handle writeback state
    well, so there may be potential racing problem:

    a)
    kworker: f2fs_gc:
    - f2fs_write_data_pages
    - f2fs_write_data_page
    - do_write_data_page
    - write_data_page
    - f2fs_submit_page_mbio
    (page#1 in inode's page cache was queued
    in f2fs bio cache, and be ready to write
    to new blkaddr)
    - gc_data_segment
    - move_encrypted_block
    - pagecache_get_page
    (page#2 in meta inode's page cache
    was cached with the invalid datas
    of physical block located in new
    blkaddr)
    - f2fs_submit_page_mbio
    (page#1 was submitted, later, page#2
    with invalid data will be submitted)

    b)
    f2fs_gc:
    - gc_data_segment
    - move_encrypted_block
    - f2fs_submit_page_mbio
    (page#1 in meta inode's page cache was
    queued in f2fs bio cache, and be ready
    to write to new blkaddr)
    user thread:
    - f2fs_write_begin
    - f2fs_submit_page_bio
    (we submit the request to block layer
    to update page#2 in inode's page cache
    with physical block located in new
    blkaddr, so here we may read gabbage
    data from new blkaddr since GC hasn't
    writebacked the page#1 yet)

    This patch fixes above potential racing problem for encrypted inode.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     

13 Oct, 2015

9 commits

  • After finishing building free nid cache, we will try to readahead
    asynchronously 4 more pages for the next reloading, the count of
    readahead nid pages is fixed.

    In some case, like SMR drive, read less sectors with fixed count
    each time we trigger RA may be low efficient, since we will face
    high seeking overhead, so we'd better let user to configure this
    parameter from sysfs in specific workload.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • When there is no free nid in nid cache, all new node allocaters stop their
    job to wait for reloading of free nids, however reloading is synchronous as
    we will read 4 NAT pages for building nid cache, it cause the long latency.

    This patch tries to readahead more NAT pages with READA request flag after
    reloading of free nids. It helps to improve performance when users allocate
    node id intensively.

    Env: Sandisk 32G sd card
    time for i in `seq 1 60000`; { echo -n > /mnt/f2fs/$i; echo XXXXXX > /mnt/f2fs/$i;}

    Before:
    real 0m2.814s
    user 0m1.220s
    sys 0m1.536s

    After:
    real 0m2.711s
    user 0m1.136s
    sys 0m1.568s

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • Now, we use ra_meta_pages to reads continuous physical blocks as much as
    possible to improve performance of following reads. However, ra_meta_pages
    uses a synchronous readahead approach by submitting bio with READ, as READ
    is with high priority, it can not be used in the case of preloading blocks,
    and it's not sure when these RAed pages will be used.

    This patch supports asynchronous readahead in ra_meta_pages by tagging bio
    with READA flag in order to allow preloading.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • In recovery or checkpoint flow, we grab pages temperarily in meta inode's
    mapping for caching temperary data, actually, datas in these pages were
    not meta data of f2fs, but still we tag them with REQ_META flag. However,
    lower device like eMMC may do some optimization for data of such type.
    So in order to avoid wrong optimization, we'd better remove such flag
    for temperary non-meta pages.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • This patch adds a tracepoint for f2fs_read_data_pages to trace when pages
    are readahead by VFS.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • For normal inodes, their pages are allocated with __GFP_FS, which can cause
    filesystem calls when reclaiming memory.
    This can incur a dead lock condition accordingly.

    So, this patch addresses this problem by introducing
    f2fs_grab_cache_page(.., bool for_write), which calls
    grab_cache_page_write_begin() with AOP_FLAG_NOFS.

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • The f2fs_collapse_range and f2fs_insert_range changes the block addresses
    directly. But that can cause uncovered SSA updates.
    In that case, we need to give up to change the block addresses and do buffered
    writes to keep filesystem consistency.

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • The periodic checkpoint can resolve the previous issue.
    So, now we can use this again to improve the reported performance regression:

    https://lkml.org/lkml/2015/10/8/20

    This reverts commit 15bec0ff5a9ba6d203178fa8772259df6207942a.

    Jaegeuk Kim
     
  • This patch introduces F2FS_GOING_DOWN_METAFLUSH which flushes meta pages like
    SSA blocks and then blocks all the writes.
    This can be used by power-failure tests.

    Reviewed-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     

10 Oct, 2015

21 commits

  • This patch tries to merge IOs as many as possible when background flusher
    conducts flushing the dirty meta pages.

    [Before]

    ...
    2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 124320, size = 4096
    f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 124560, size = 32768
    f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 95720, size = 987136
    f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 123928, size = 4096
    f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 123944, size = 8192
    f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 123968, size = 45056
    f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 124064, size = 4096
    f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 97648, size = 1007616
    f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 123776, size = 8192
    f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 123800, size = 32768
    f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 124624, size = 4096
    f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 99616, size = 921600
    f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 123608, size = 4096
    f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 123624, size = 77824
    f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 123792, size = 4096
    f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 123864, size = 32768
    ...

    [After]

    ...
    f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 92168, size = 892928
    f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 93912, size = 753664
    f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 95384, size = 716800
    f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 96784, size = 712704
    f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 104160, size = 364544
    f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 104872, size = 356352
    f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 105568, size = 278528
    f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 106112, size = 319488
    f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 106736, size = 258048
    f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 107240, size = 270336
    f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 107768, size = 180224
    ...

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • This patch introduces a periodic checkpoint feature.
    Note that, this is not enforcing to conduct checkpoints very strictly in terms
    of trigger timing, instead just hope to help user experiences.
    The default value is 60 seconds.

    Reviewed-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • This patch introduces a tracepoint to monitor background gc behaviors.

    Reviewed-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • This patch introduce background_gc=sync enabling synchronous cleaning in
    background.

    Reviewed-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • This patch introduce a new ioctl for those users who want to trigger
    checkpoint from userspace through ioctl.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • This patch drops in batches gc triggered through ioctl, since user
    can easily control the gc by designing the loop around the ->ioctl.

    We support synchronous gc by forcing using FG_GC in f2fs_gc, so with
    it, user can make sure that in this round all blocks gced were
    persistent in the device until ioctl returned.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • When searching victim during gc, if there are no dirty segments in
    filesystem, we will still take the time to search the whole dirty segment
    map, it's not needed, it's better to skip in this condition.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • When doing gc, we search a victim in dirty map, starting from position of
    last victim, we will reset the current searching position until we touch
    the end of dirty map, and then search the whole diryt map. So sometimes we
    will search the range [victim, last] twice, it's redundant, this patch
    avoids this issue.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • Our hit stat of extent cache will increase all the time until remount,
    and we use atomic_t type for the stat variable, so it may easily incur
    overflow when we query extent cache frequently in a long time running
    fs.

    So to avoid that, this patch uses atomic64_t for hit stat variables.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • This patch introduces f2fs_kvmalloc to avoid -ENOMEM during mount.

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • If we do not call get_victim first, we cannot get a new victim for retrial
    path.

    Reviewed-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • This patch fixes to maintain the right section count freed in garbage
    collecting when triggering a foreground gc.

    Besides, when a foreground gc is running on current selected section, once
    we fail to gc one segment, it's better to abandon gcing the left segments
    in current section, because anyway we will select next victim for
    foreground gc, so gc on the left segments in previous section will become
    overhead and also cause the long latency for caller.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • This patch fixes to update ctime and atime correctly when truncating
    larger in ->setattr.

    The bug is reported by xfstest generic/313 as below:

    generic/313 2s ... - output mismatch (see ./results/generic/313.out.bad)
    --- tests/generic/313.out 2015-08-04 15:28:53.430798882 +0800
    +++ results/generic/313.out.bad 2015-09-28 17:04:27.294278016 +0800
    @@ -1,2 +1,4 @@
    QA output created by 313
    Silence is golden
    +ctime not updated after truncate up
    +mtime not updated after truncate up
    ...
    (Run 'diff -u tests/generic/313.out tests/generic/313.out.bad' to see the entire diff)
    Ran: generic/313
    Failures: generic/313
    Failed 1 of 1 tests

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • Previously, we skip dentry block writes when wbc is SYNC_NONE with no memory
    pressure and the number of dirty pages is pretty small.

    But, we didn't skip for normal data writes, which gives us not much big impact
    on overall performance.
    Moreover, by skipping some data writes, kworker falls into infinite loop to try
    to write blocks, when many dir inodes have only one dentry block.

    So, this patch removes skipping data writes.

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • Protecting recovery flow by using cp_rwsem is not needed, since we have
    prevent triggering any checkpoint by locking cp_mutex previously.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • In update_sit_info, we use div_u64 to handle 'u64 divide u64' case, but
    div_u64 can only handle 32-bits divisor, so our divisor with u64 type
    passed to div_u64 will overflow, result in the wrong calculation when
    show debug info of f2fs as below:

    BDF: 464, avg. vblocks: 23509
    (BDF should never exceed 100)

    So change to use div64_u64 to handle this case correctly.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • This patch adds a new helper __try_update_largest_extent for cleanup.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • This fixes error handling for calls to various functions in the
    function recover_inline_data to check if these particular functions
    either return a error code or the boolean value false to signal their
    caller they have failed internally and if this arises return false
    to signal failure immediately to the caller of recover_inline_data
    as we cannot continue after failures to calling either the function
    truncate_inline_inode or truncate_blocks.

    Signed-off-by: Nicholas Krause
    Signed-off-by: Jaegeuk Kim

    Nicholas Krause
     
  • Swith extent_cache option dynamically when remount may casue consistency
    issue between extent cache and dnode page. Fix in this patch to avoid
    that condition.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • We introduce F2FS_GET_BLOCK_READ in commit e2b4e2bc8865 ("f2fs: fix
    incorrect mapping for bmap"), but forget to use this flag in the right
    place, fix it.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • Here is a oops reported as following message when testing generic/019 of
    xfstest:

    ------------[ cut here ]------------
    kernel BUG at /home/yuchao/git/f2fs-dev/segment.c:882!
    invalid opcode: 0000 [#1] SMP
    Modules linked in: zram lz4_compress lz4_decompress f2fs(O) ip6table_filter ip6_tables ebtable_nat ebtables nf_conntrack_ipv4
    nf_def
    CPU: 2 PID: 25441 Comm: fio Tainted: G O 4.3.0-rc1+ #6
    Hardware name: Hewlett-Packard HP Z220 CMT Workstation/1790, BIOS K51 v01.61 05/16/2013
    task: ffff8803f4e85580 ti: ffff8803fd61c000 task.ti: ffff8803fd61c000
    RIP: 0010:[] [] new_curseg+0x321/0x330 [f2fs]
    RSP: 0018:ffff8803fd61f918 EFLAGS: 00010246
    RAX: 00000000000007ed RBX: 0000000000000224 RCX: 000000000000001f
    RDX: 0000000000000800 RSI: ffffffffffffffff RDI: ffff8803f56f4300
    RBP: ffff8803fd61f978 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000024 R11: ffff8800d23bbd78 R12: ffff8800d0ef0000
    R13: 0000000000000224 R14: 0000000000000000 R15: 0000000000000001
    FS: 00007f827ff85700(0000) GS:ffff88041ea80000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffffffffff600000 CR3: 00000003fef17000 CR4: 00000000001406e0
    Stack:
    000007ea00000002 0000000100000001 ffff8803f6456248 000007ed0000002b
    0000000000000224 ffff880404d1aa20 ffff8803fd61f9c8 ffff8800d0ef0000
    ffff8803f6456248 0000000000000001 00000000ffffffff ffffffffa078f358
    Call Trace:
    [] allocate_segment_by_default+0x1a7/0x1f0 [f2fs]
    [] allocate_data_block+0x17c/0x360 [f2fs]
    [] __allocate_data_block+0x131/0x1d0 [f2fs]
    [] f2fs_direct_IO+0x4b5/0x580 [f2fs]
    [] generic_file_direct_write+0xae/0x160
    [] __generic_file_write_iter+0xd5/0x1f0
    [] generic_file_write_iter+0xf7/0x200
    [] ? apparmor_file_permission+0x18/0x20
    [] ? f2fs_fallocate+0x1190/0x1190 [f2fs]
    [] f2fs_file_write_iter+0x46/0x90 [f2fs]
    [] aio_run_iocb+0x1ee/0x290
    [] ? mutex_lock+0x1e/0x50
    [] ? aio_read_events+0x207/0x2b0
    [] do_io_submit+0x373/0x630
    [] ? SyS_io_getevents+0x56/0xb0
    [] SyS_io_submit+0x10/0x20
    [] entry_SYSCALL_64_fastpath+0x12/0x6a
    Code: 45 c8 48 8b 78 10 e8 9f 23 bf e0 41 8b 8c 24 cc 03 00 00 89 c7 31 d2 89 c6 89 d8 29 df f7 f1 29 d1 39 cf 0f 83 be fd ff ff eb
    RIP [] new_curseg+0x321/0x330 [f2fs]
    RSP
    ---[ end trace 2e577d7f711ddb86 ]---

    The reason is that: in the test of generic/019, we will trigger a manmade
    IO error in block layer through debugfs, after that, prefree segment will
    no longer be freed, because we always skip doing gc or checkpoint when
    there occurs an IO error.

    Meanwhile fio with aio engine generated a large number of direct IOs,
    which continue allocating spaces in free segment until we run out of them,
    eventually, results in panic in new_curseg as no more free segment was
    found.

    So, this patch changes to return EIO in direct_IO for this condition.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu