01 Oct, 2016

2 commits

  • This patch introduces spinlock to protect updating process of ckpt_flags
    field in struct f2fs_checkpoint, it avoids incorrectly updating in race
    condition.

    Signed-off-by: Chao Yu
    [Jaegeuk Kim: add __is_set_ckpt_flags likewise __set_ckpt_flags]
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • Previously, we used cp_version only to detect recoverable dnodes.
    In order to avoid same garbage cp_version, we needed to truncate the next
    dnode during checkpoint, resulting in additional discard or data write.
    If we can distinguish this by using crc in addition to cp_version, we can
    remove this overhead.

    There is backward compatibility concern where it changes node_footer layout.
    So, this patch introduces a new checkpoint flag, CP_CRC_RECOVERY_FLAG, to
    detect new layout. New layout will be activated only when this flag is set.

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     

07 Jul, 2016

1 commit


08 Jun, 2016

2 commits


23 Feb, 2016

4 commits

  • In write_begin, if storage supports stable_page, we don't need to wait for
    writeback to update its contents.
    This patch introduces to use wait_for_stable_page instead of
    wait_on_page_writeback.

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • The sceanrio is:
    1. create fully node blocks
    2. flush node blocks
    3. write inline_data for all the node blocks again
    4. flush node blocks redundantly

    So, this patch tries to flush inline_data when flushing node blocks.

    Reviewed-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • This patch exports a new sysfs entry 'dirty_nat_ratio' to control threshold
    of dirty nat entries, if current ratio exceeds configured threshold,
    checkpoint will be triggered in f2fs_balance_fs_bg for flushing dirty nats.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • When testing f2fs with xfstest, generic/251 is stuck for long time,
    the case uses below serials to obtain fresh released space in device,
    in order to prepare for following fstrim test.

    1. rm -rf /mnt/dir
    2. mkdir /mnt/dir/
    3. cp -axT `pwd`/ /mnt/dir/
    4. goto 1

    During preparing step, all nat entries will be cached in nat cache,
    most of them are dirty entries with invalid blkaddr, which means
    nodes related to these entries have been truncated, and they could
    be reused after the dirty entries been checkpointed.

    However, there was no checkpoint been triggered, so nid allocators
    (e.g. mkdir, creat) will run into long journey of iterating all NAT
    pages, looking for free nids in alloc_nid->build_free_nids.

    Here, in f2fs_balance_fs_bg we give another chance to do checkpoint
    to flush nat entries for reusing them in free nid cache when dirty
    entry count exceeds 10% of max count.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     

09 Jan, 2016

1 commit


05 Dec, 2015

1 commit


13 Oct, 2015

2 commits

  • After finishing building free nid cache, we will try to readahead
    asynchronously 4 more pages for the next reloading, the count of
    readahead nid pages is fixed.

    In some case, like SMR drive, read less sectors with fixed count
    each time we trigger RA may be low efficient, since we will face
    high seeking overhead, so we'd better let user to configure this
    parameter from sysfs in specific workload.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • The periodic checkpoint can resolve the previous issue.
    So, now we can use this again to improve the reported performance regression:

    https://lkml.org/lkml/2015/10/8/20

    This reverts commit 15bec0ff5a9ba6d203178fa8772259df6207942a.

    Jaegeuk Kim
     

10 Oct, 2015

1 commit

  • Previously, we skip dentry block writes when wbc is SYNC_NONE with no memory
    pressure and the number of dirty pages is pretty small.

    But, we didn't skip for normal data writes, which gives us not much big impact
    on overall performance.
    Moreover, by skipping some data writes, kworker falls into infinite loop to try
    to write blocks, when many dir inodes have only one dentry block.

    So, this patch removes skipping data writes.

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     

29 May, 2015

1 commit


04 Mar, 2015

1 commit

  • Introduce infra macro and data structure for rb-tree based extent cache:

    Macros:
    * EXT_TREE_VEC_SIZE: indicate vector size for gang lookup in extent tree.
    * F2FS_MIN_EXTENT_LEN: indicate minimum length of extent managed in cache.
    * EXTENT_CACHE_SHRINK_NUMBER: indicate number of extent in cache will be shrunk.

    Basic data structures for extent cache:
    * struct extent_tree: extent tree entry per inode.
    * struct extent_node: extent info node linked in extent tree.

    Besides, adding new extent cache related fields in f2fs_sb_info.

    Signed-off-by: Chao Yu
    Signed-off-by: Changman Lee
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     

10 Jan, 2015

4 commits

  • In the normal case, the radix_tree_nodes are freed successfully.
    But, when cp_error was detected, we should destroy them forcefully.

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • In do_recover_data, we find and update previous node pages after updating
    its new block addresses.
    After then, we call fill_node_footer without reset field, we erase its
    cold bit so that this new cold node block is written to wrong log area.
    This patch fixes not to miss its old flag.

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • This patch moves one member of struct nat_entry: _flag_ to struct node_info,
    so _version_ in struct node_info and _flag_ which are unsigned char type will
    merge to one 32-bit space in register/memory. So the size of nat_entry will be
    reduced from 28 bytes to 24 bytes (for 64-bit machine, reduce its size from 40
    bytes to 32 bytes) and then slab memory using by f2fs will be reduced.

    changes from v2:
    o update description of memory usage gain for 64-bit machine suggested by
    Changman Lee.
    changes from v1:
    o introduce inline copy_node_info() to copy valid data from node info suggested
    by Jaegeuk Kim, it can avoid bug.

    Reviewed-by: Changman Lee
    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • This patch adds two new ioctls to release inmemory pages grabbed by atomic
    writes.
    o f2fs_ioc_abort_volatile_write
    - If transaction was failed, all the grabbed pages and data should be written.
    o f2fs_ioc_release_volatile_write
    - This is to enhance the performance of PERSIST mode in sqlite.

    In order to avoid huge memory consumption which causes OOM, this patch changes
    volatile writes to use normal dirty pages, instead blocked flushing to the disk
    as long as system does not suffer from memory pressure.

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     

07 Nov, 2014

1 commit


04 Nov, 2014

1 commit


06 Oct, 2014

1 commit


01 Oct, 2014

1 commit

  • Previously, f2fs tries to reorganize the dirty nat entries into multiple sets
    according to its nid ranges. This can improve the flushing nat pages, however,
    if there are a lot of cached nat entries, it becomes a bottleneck.

    This patch introduces a new set management flow by removing dirty nat list and
    adding a series of set operations when the nat entry becomes dirty.

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     

24 Sep, 2014

2 commits

  • This patch revisited whole the recovery information during the f2fs_sync_file.

    In this patch, there are three information to make a decision.

    a) IS_CHECKPOINTED, /* is it checkpointed before? */
    b) HAS_FSYNCED_INODE, /* is the inode fsynced before? */
    c) HAS_LAST_FSYNC, /* has the latest node fsync mark? */

    And, the scenarios for our rule are based on:

    [Term] F: fsync_mark, D: dentry_mark

    1. inode(x) | CP | inode(x) | dnode(F)
    2. inode(x) | CP | inode(F) | dnode(F)
    3. inode(x) | CP | dnode(F) | inode(x) | inode(F)
    4. inode(x) | CP | dnode(F) | inode(F)
    5. CP | inode(x) | dnode(F) | inode(DF)
    6. CP | inode(DF) | dnode(F)
    7. CP | dnode(F) | inode(DF)
    8. CP | dnode(F) | inode(x) | inode(DF)

    For example, #3, the three conditions should be changed as follows.

    inode(x) | CP | dnode(F) | inode(x) | inode(F)
    a) x o o o o
    b) x x x x o
    c) x o o x o

    If f2fs_sync_file stops ------^,
    it should write inode(F) --------------^

    So, the need_inode_block_update should return true, since
    c) get_nat_flag(e, HAS_LAST_FSYNC), is false.

    For example, #8,
    CP | alloc | dnode(F) | inode(x) | inode(DF)
    a) o x x x x
    b) x x x o
    c) o o x o

    If f2fs_sync_file stops -------^,
    it should write inode(DF) --------------^

    Note that, the roll-forward policy should follow this rule, which means,
    if there are any missing blocks, we doesn't need to recover that inode.

    Signed-off-by: Huang Ying
    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • This patch introduces a flag in the nat entry structure to merge various
    information such as checkpointed and fsync_done marks.

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     

16 Sep, 2014

1 commit

  • The nm_i->fcnt checking is executed before spin_lock, so if another
    thread delete the last free_nid from the list, the wrong nid may be
    gotten. So fix the race condition by moving the nm_i->fnct checking
    into spin_lock.

    Signed-off-by: Huang, Ying
    Signed-off-by: Jaegeuk Kim

    Huang Ying
     

04 Sep, 2014

1 commit


10 Jul, 2014

1 commit

  • Although building NAT journal in cursum reduce the read/write work for NAT
    block, but previous design leave us lower performance when write checkpoint
    frequently for these cases:
    1. if journal in cursum has already full, it's a bit of waste that we flush all
    nat entries to page for persistence, but not to cache any entries.
    2. if journal in cursum is not full, we fill nat entries to journal util
    journal is full, then flush the left dirty entries to disk without merge
    journaled entries, so these journaled entries may be flushed to disk at next
    checkpoint but lost chance to flushed last time.

    In this patch we merge dirty entries located in same NAT block to nat entry set,
    and linked all set to list, sorted ascending order by entries' count of set.
    Later we flush entries in sparse set into journal as many as we can, and then
    flush merged entries to disk. In this way we can not only gain in performance,
    but also save lifetime of flash device.

    In my testing environment, it shows this patch can help to reduce NAT block
    writes obviously. In hard disk test case: cost time of fsstress is stablely
    reduced by about 5%.

    1. virtual machine + hard disk:
    fsstress -p 20 -n 200 -l 5
    node num cp count nodes/cp
    based 4599.6 1803.0 2.551
    patched 2714.6 1829.6 1.483

    2. virtual machine + 32g micro SD card:
    fsstress -p 20 -n 200 -l 1 -w -f chown=0 -f creat=4 -f dwrite=0
    -f fdatasync=4 -f fsync=4 -f link=0 -f mkdir=4 -f mknod=4 -f rename=5
    -f rmdir=5 -f symlink=0 -f truncate=4 -f unlink=5 -f write=0 -S

    node num cp count nodes/cp
    based 84.5 43.7 1.933
    patched 49.2 40.0 1.23

    Our latency of merging op shows not bad when handling extreme case like:
    merging a great number of dirty nats:
    latency(ns) dirty nat count
    3089219 24922
    5129423 27422
    4000250 24523

    change log from v1:
    o fix wrong logic in add_nat_entry when grab a new nat entry set.
    o swith to create slab cache in create_node_manager_caches.
    o use GFP_ATOMIC instead of GFP_NOFS to avoid potential long latency.

    change log from v2:
    o make comment position more appropriate suggested by Jaegeuk Kim.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     

07 May, 2014

4 commits


20 Mar, 2014

3 commits

  • If multiple redundant fsync calls are triggered, we don't need to write its
    node pages with fsync mark continuously.

    So, this patch adds FI_NEED_FSYNC to track whether the latest node block is
    written with the fsync mark or not.
    If the mark was set, a new fsync doesn't need to write a node block.
    Otherwise, we should do a new node block with the mark for roll-forward
    recovery.

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • The NM_WOUT_THRESHOLD is now obsolete since f2fs starts to control on a basis
    of the memory footprint.

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • This patch introduces ram_thresh, a sysfs entry, which controls the memory
    footprint used by the free nid list and the nat cache.

    Previously, the free nid list was controlled by MAX_FREE_NIDS, while the nat
    cache was managed by NM_WOUT_THRESHOLD.
    However, this approach cannot be applied dynamically according to the system.

    So, this patch adds ram_thresh that users can specify the threshold, which is
    in order of 1 / 1024.
    For example, if the total ram size is 4GB and the value is set to 10 by default,
    f2fs tries to control the number of free nids and nat caches not to consume over
    10 * (4GB / 1024) = 10MB.

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     

18 Mar, 2014

1 commit


24 Feb, 2014

1 commit


23 Dec, 2013

1 commit

  • Update several comments:
    1. use f2fs_{un}lock_op install of mutex_{un}lock_op.
    2. update comment of get_data_block().
    3. update description of node offset.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     

09 Aug, 2013

1 commit

  • This patch fixes the use of XATTR_NODE_OFFSET.

    o The offset should not use several MSB bits which are used by marking node
    blocks.

    o IS_DNODE should handle XATTR_NODE_OFFSET to avoid potential abnormality
    during the fsync call.

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim