08 Oct, 2014

1 commit

  • This patch adds support for volatile writes which keep data pages in memory
    until f2fs_evict_inode is called by iput.

    For instance, we can use this feature for the sqlite database as follows.
    While supporting atomic writes for main database file, we can keep its journal
    data temporarily in the page cache by the following sequence.

    1. open
    -> ioctl(F2FS_IOC_START_VOLATILE_WRITE);
    2. writes
    : keep all the data in the page cache.
    3. flush to the database file with atomic writes
    a. ioctl(F2FS_IOC_START_ATOMIC_WRITE);
    b. writes
    c. ioctl(F2FS_IOC_COMMIT_ATOMIC_WRITE);
    4. close
    -> drop the cached data

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     

07 Oct, 2014

1 commit

  • This patch introduces a very limited functionality for atomic write support.
    In order to support atomic write, this patch adds two ioctls:
    o F2FS_IOC_START_ATOMIC_WRITE
    o F2FS_IOC_COMMIT_ATOMIC_WRITE

    The database engine should be aware of the following sequence.
    1. open
    -> ioctl(F2FS_IOC_START_ATOMIC_WRITE);
    2. writes
    : all the written data will be treated as atomic pages.
    3. commit
    -> ioctl(F2FS_IOC_COMMIT_ATOMIC_WRITE);
    : this flushes all the data blocks to the disk, which will be shown all or
    nothing by f2fs recovery procedure.
    4. repeat to #2.

    The IO pattens should be:

    ,- START_ATOMIC_WRITE ,- COMMIT_ATOMIC_WRITE
    CP | D D D D D D | FSYNC | D D D D | FSYNC ...
    `- COMMIT_ATOMIC_WRITE

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     

01 Oct, 2014

4 commits


24 Sep, 2014

3 commits

  • If same data is updated multiple times, we don't need to redo whole the
    operations.
    Let's just update the lastest one.

    Reviewed-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • This patch revisited whole the recovery information during the f2fs_sync_file.

    In this patch, there are three information to make a decision.

    a) IS_CHECKPOINTED, /* is it checkpointed before? */
    b) HAS_FSYNCED_INODE, /* is the inode fsynced before? */
    c) HAS_LAST_FSYNC, /* has the latest node fsync mark? */

    And, the scenarios for our rule are based on:

    [Term] F: fsync_mark, D: dentry_mark

    1. inode(x) | CP | inode(x) | dnode(F)
    2. inode(x) | CP | inode(F) | dnode(F)
    3. inode(x) | CP | dnode(F) | inode(x) | inode(F)
    4. inode(x) | CP | dnode(F) | inode(F)
    5. CP | inode(x) | dnode(F) | inode(DF)
    6. CP | inode(DF) | dnode(F)
    7. CP | dnode(F) | inode(DF)
    8. CP | dnode(F) | inode(x) | inode(DF)

    For example, #3, the three conditions should be changed as follows.

    inode(x) | CP | dnode(F) | inode(x) | inode(F)
    a) x o o o o
    b) x x x x o
    c) x o o x o

    If f2fs_sync_file stops ------^,
    it should write inode(F) --------------^

    So, the need_inode_block_update should return true, since
    c) get_nat_flag(e, HAS_LAST_FSYNC), is false.

    For example, #8,
    CP | alloc | dnode(F) | inode(x) | inode(DF)
    a) o x x x x
    b) x x x o
    c) o o x o

    If f2fs_sync_file stops -------^,
    it should write inode(DF) --------------^

    Note that, the roll-forward policy should follow this rule, which means,
    if there are any missing blocks, we doesn't need to recover that inode.

    Signed-off-by: Huang Ying
    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • Previously, all the dnode pages should be read during the roll-forward recovery.
    Even worsely, whole the chain was traversed twice.
    This patch removes that redundant and costly read operations by using page cache
    of meta_inode and readahead function as well.

    Reviewed-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     

16 Sep, 2014

2 commits

  • If user wrote F2FS_IPU_FSYNC:4 in /sys/fs/f2fs/ipu_policy, f2fs_sync_file
    only starts to try in-place-updates.
    And, if the number of dirty pages is over /sys/fs/f2fs/min_fsync_blocks, it
    keeps out-of-order manner. Otherwise, it triggers in-place-updates.

    This may be used by storage showing very high random write performance.

    For example, it can be used when,

    Seq. writes (Data) + wait + Seq. writes (Node)

    is pretty much slower than,

    Rand. writes (Data)

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • Previously f2fs only counts dirty dentry pages, but there is no reason not to
    expand the scope.

    This patch changes the names on the management of dirty pages and to count
    dirty pages in each inode info as well.

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     

10 Sep, 2014

4 commits

  • We use flush cmd control to collect many flush cmds, and flush them
    together. In this case, we use two list to manage the flush cmds
    (collect and dispatch), and one spin lock is used to protect this.
    In fact, the lock-less list(llist) is very suitable to this case,
    and we use simplify this routine.

    -
    v2:
    -use llist_for_each_entry_safe to fix possible use-after-free issue.
    -remove the unused field from struct flush_cmd.
    Thanks for Yu's suggestion.
    -

    Signed-off-by: Gu Zheng
    Signed-off-by: Jaegeuk Kim

    Gu Zheng
     
  • In commit aec71382c681 ("f2fs: refactor flush_nat_entries codes for reducing NAT
    writes"), we descripte the issue as below:

    "Although building NAT journal in cursum reduce the read/write work for NAT
    block, but previous design leave us lower performance when write checkpoint
    frequently for these cases:
    1. if journal in cursum has already full, it's a bit of waste that we flush all
    nat entries to page for persistence, but not to cache any entries.
    2. if journal in cursum is not full, we fill nat entries to journal util
    journal is full, then flush the left dirty entries to disk without merge
    journaled entries, so these journaled entries may be flushed to disk at next
    checkpoint but lost chance to flushed last time."

    Actually, we have the same problem in using SIT journal area.

    In this patch, firstly we will update sit journal with dirty entries as many as
    possible. Secondly if there is no space in sit journal, we will remove all
    entries in journal and walk through the whole dirty entry bitmap of sit,
    accounting dirty sit entries located in same SIT block to sit entry set. All
    entry sets are linked to list sit_entry_set in sm_info, sorted ascending order
    by count of entries in set. Later we flush entries in set which have fewest
    entries into journal as many as we can, and then flush dense set with merged
    entries to disk.

    In this way we can use sit journal area more effectively, also we will reduce
    SIT update, result in gaining in performance and saving lifetime of flash
    device.

    In my testing environment, it shows this patch can help to reduce SIT block
    update obviously.

    virtual machine + hard disk:
    fsstress -p 20 -n 400 -l 5
    sit page num cp count sit pages/cp
    based 2006.50 1349.75 1.486
    patched 1566.25 1463.25 1.070

    Our latency of merging op is small when handling a great number of dirty SIT
    entries in flush_sit_entries:
    latency(ns) dirty sit count
    36038 2151
    49168 2123
    37174 2232

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • If any f2fs_bug_on is triggered, fsck.f2fs is needed.

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • This patch adds sbi->need_fsck to conduct fsck.f2fs later.
    This flag can only be removed by fsck.f2fs.

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     

04 Sep, 2014

1 commit


22 Aug, 2014

5 commits


20 Aug, 2014

4 commits


02 Aug, 2014

1 commit


31 Jul, 2014

5 commits


29 Jul, 2014

4 commits


12 Jul, 2014

1 commit


10 Jul, 2014

4 commits

  • Signed-off-by: Gu Zheng
    Signed-off-by: Jaegeuk Kim

    Gu Zheng
     
  • Although building NAT journal in cursum reduce the read/write work for NAT
    block, but previous design leave us lower performance when write checkpoint
    frequently for these cases:
    1. if journal in cursum has already full, it's a bit of waste that we flush all
    nat entries to page for persistence, but not to cache any entries.
    2. if journal in cursum is not full, we fill nat entries to journal util
    journal is full, then flush the left dirty entries to disk without merge
    journaled entries, so these journaled entries may be flushed to disk at next
    checkpoint but lost chance to flushed last time.

    In this patch we merge dirty entries located in same NAT block to nat entry set,
    and linked all set to list, sorted ascending order by entries' count of set.
    Later we flush entries in sparse set into journal as many as we can, and then
    flush merged entries to disk. In this way we can not only gain in performance,
    but also save lifetime of flash device.

    In my testing environment, it shows this patch can help to reduce NAT block
    writes obviously. In hard disk test case: cost time of fsstress is stablely
    reduced by about 5%.

    1. virtual machine + hard disk:
    fsstress -p 20 -n 200 -l 5
    node num cp count nodes/cp
    based 4599.6 1803.0 2.551
    patched 2714.6 1829.6 1.483

    2. virtual machine + 32g micro SD card:
    fsstress -p 20 -n 200 -l 1 -w -f chown=0 -f creat=4 -f dwrite=0
    -f fdatasync=4 -f fsync=4 -f link=0 -f mkdir=4 -f mknod=4 -f rename=5
    -f rmdir=5 -f symlink=0 -f truncate=4 -f unlink=5 -f write=0 -S

    node num cp count nodes/cp
    based 84.5 43.7 1.933
    patched 49.2 40.0 1.23

    Our latency of merging op shows not bad when handling extreme case like:
    merging a great number of dirty nats:
    latency(ns) dirty nat count
    3089219 24922
    5129423 27422
    4000250 24523

    change log from v1:
    o fix wrong logic in add_nat_entry when grab a new nat entry set.
    o swith to create slab cache in create_node_manager_caches.
    o use GFP_ATOMIC instead of GFP_NOFS to avoid potential long latency.

    change log from v2:
    o make comment position more appropriate suggested by Jaegeuk Kim.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • This patch cleans up simple unnecessary codes.

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • This patch adds f2fs_do_tmpfile to eliminate the redundant init_inode_metadata
    flow.
    Throught this, we can provide the consistent lock usage, e.g., fi->i_sem, and
    this will enable better debugging stuffs.

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim