08 Oct, 2014

1 commit

  • This patch adds support for volatile writes which keep data pages in memory
    until f2fs_evict_inode is called by iput.

    For instance, we can use this feature for the sqlite database as follows.
    While supporting atomic writes for main database file, we can keep its journal
    data temporarily in the page cache by the following sequence.

    1. open
    -> ioctl(F2FS_IOC_START_VOLATILE_WRITE);
    2. writes
    : keep all the data in the page cache.
    3. flush to the database file with atomic writes
    a. ioctl(F2FS_IOC_START_ATOMIC_WRITE);
    b. writes
    c. ioctl(F2FS_IOC_COMMIT_ATOMIC_WRITE);
    4. close
    -> drop the cached data

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     

07 Oct, 2014

1 commit

  • This patch introduces a very limited functionality for atomic write support.
    In order to support atomic write, this patch adds two ioctls:
    o F2FS_IOC_START_ATOMIC_WRITE
    o F2FS_IOC_COMMIT_ATOMIC_WRITE

    The database engine should be aware of the following sequence.
    1. open
    -> ioctl(F2FS_IOC_START_ATOMIC_WRITE);
    2. writes
    : all the written data will be treated as atomic pages.
    3. commit
    -> ioctl(F2FS_IOC_COMMIT_ATOMIC_WRITE);
    : this flushes all the data blocks to the disk, which will be shown all or
    nothing by f2fs recovery procedure.
    4. repeat to #2.

    The IO pattens should be:

    ,- START_ATOMIC_WRITE ,- COMMIT_ATOMIC_WRITE
    CP | D D D D D D | FSYNC | D D D D | FSYNC ...
    `- COMMIT_ATOMIC_WRITE

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     

06 Oct, 2014

1 commit


01 Oct, 2014

7 commits

  • This patch cleans up f2fs_ioctl functions for better readability.

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • My static checker complains that segment is a u64 but only the lower 31
    bits can be used before we hit a shift wrapping bug.

    Signed-off-by: Dan Carpenter
    Signed-off-by: Jaegeuk Kim

    Dan Carpenter
     
  • This patch relocates f2fs_unlock_op in every directory operations to be called
    after any error was processed.
    Otherwise, the checkpoint can be entered with valid node ids without its
    dentry when -ENOSPC is occurred.

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • This patch cleans up the existing and new macros for readability.

    Rule is like this.

    ,-----------------------------------------> MAX_BLKADDR -,
    | ,------------- TOTAL_BLKS ----------------------------,
    | | |
    | ,- seg0_blkaddr ,----- sit/nat/ssa/main blkaddress |
    block | | (SEG0_BLKADDR) | | | | (e.g., MAIN_BLKADDR) |
    address 0..x................ a b c d .............................
    | |
    global seg# 0...................... m .............................
    | | |
    | `------- MAIN_SEGS -----------'
    `-------------- TOTAL_SEGS ---------------------------'
    | |
    seg# 0..........xx..................

    = Note =
    o GET_SEGNO_FROM_SEG0 : blk address -> global segno
    o GET_SEGNO : blk address -> segno
    o START_BLOCK : segno -> starting block address

    Reviewed-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • Previously, f2fs tries to reorganize the dirty nat entries into multiple sets
    according to its nid ranges. This can improve the flushing nat pages, however,
    if there are a lot of cached nat entries, it becomes a bottleneck.

    This patch introduces a new set management flow by removing dirty nat list and
    adding a series of set operations when the nat entry becomes dirty.

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • This patch introduces FITRIM in f2fs_ioctl.
    In this case, f2fs will issue small discards and prefree discards as many as
    possible for the given area.

    Reviewed-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • This patch add a new data structure to control checkpoint parameters.
    Currently, it presents the reason of checkpoint such as is_umount and normal
    sync.

    Reviewed-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     

24 Sep, 2014

15 commits

  • Previously, f2fs activates SSR if the # of free segments reaches to the # of
    overprovisioned segments.
    In this case, SSR starts to use dirty segments only, so that the overprovisoned
    space cannot be selected for new data.
    This means that we have no chance to utilizae the overprovisioned space at all.

    This patch fixes that by allowing LFS allocations until the # of free segments
    reaches to the last threshold, reserved space.

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • This patch changes the ipu_policy setting to use any combination of orthogonal policies.

    Signed-off-by: Changman Lee
    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • In ->get_victim we get max_search value from dirty_i->nr_dirty without
    protection of seglist_lock, after that, nr_dirty can be increased/decreased
    before we hold seglist_lock lock.
    Then in main loop we attempt to traverse all dirty section one time to find
    victim section, but it's not accurate to use max_search as the total loop count,
    because we might lose checking several sections or check sections redundantly
    for the case of nr_dirty are increased or decreased previously.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • In manual of mount, we descript remount as below:

    "mount -o remount,rw /dev/foo /dir
    After this call all old mount options are replaced and arbitrary stuff from
    fstab is ignored, except the loop= option which is internally generated and
    maintained by the mount command."

    Previously f2fs do not clear up old mount options when remount_fs, so we have no
    chance of disabling previous option (e.g. flush_merge). Fix it.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • Now punching hole in directory is not supported in f2fs, so let's limit file
    type in punch_hole().

    In addition, in punch_hole if offset is exceed file size, we should skip
    punching hole.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • Block size in f2fs is 4096 bytes, so theoretically, f2fs can support 4096 bytes
    sector device at maximum. But now f2fs only support 512 bytes size sector, so
    block device such as zRAM which uses page cache as its block storage space will
    not be mounted successfully as mismatch between sector size of zRAM and sector
    size of f2fs supported.

    In this patch we support large sector size in f2fs, so block device with sector
    size of 512/1024/2048/4096 bytes can be supported in f2fs.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • By using FALLOC_FL_KEEP_SIZE in ->fallocate of f2fs, we can fallocate block past
    EOF without changing i_size of inode. These blocks past EOF will not be
    truncated in ->setattr as we truncate them only when change the file size.

    We should give a chance to truncate blocks out of filesize in setattr().

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • The f2fs_direct_IO uses __allocate_data_block, but inside the allocation path,
    we should update i_size at the changed time to update its inode page.
    Otherwise, we can get wrong i_size after roll-forward recovery.

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • This patch cleans up a simple macro.

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • If same data is updated multiple times, we don't need to redo whole the
    operations.
    Let's just update the lastest one.

    Reviewed-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • In f2fs_sync_file, if there is no written appended writes, it skips
    to write its node blocks.
    But, if there is up-to-date inode page, we should write it to update
    its metadata during the roll-forward recovery.

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • We can summarize the roll forward recovery scenarios as follows.

    [Term] F: fsync_mark, D: dentry_mark

    1. inode(x) | CP | inode(x) | dnode(F)
    -> Update the latest inode(x).

    2. inode(x) | CP | inode(F) | dnode(F)
    -> No problem.

    3. inode(x) | CP | dnode(F) | inode(x)
    -> Recover to the latest dnode(F), and drop the last inode(x)

    4. inode(x) | CP | dnode(F) | inode(F)
    -> No problem.

    5. CP | inode(x) | dnode(F)
    -> The inode(DF) was missing. Should drop this dnode(F).

    6. CP | inode(DF) | dnode(F)
    -> No problem.

    7. CP | dnode(F) | inode(DF)
    -> If f2fs_iget fails, then goto next to find inode(DF).

    8. CP | dnode(F) | inode(x)
    -> If f2fs_iget fails, then goto next to find inode(DF).
    But it will fail due to no inode(DF).

    So, this patch adds some missing points such as #1, #5, #7, and #8.

    Signed-off-by: Huang Ying
    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • This patch revisited whole the recovery information during the f2fs_sync_file.

    In this patch, there are three information to make a decision.

    a) IS_CHECKPOINTED, /* is it checkpointed before? */
    b) HAS_FSYNCED_INODE, /* is the inode fsynced before? */
    c) HAS_LAST_FSYNC, /* has the latest node fsync mark? */

    And, the scenarios for our rule are based on:

    [Term] F: fsync_mark, D: dentry_mark

    1. inode(x) | CP | inode(x) | dnode(F)
    2. inode(x) | CP | inode(F) | dnode(F)
    3. inode(x) | CP | dnode(F) | inode(x) | inode(F)
    4. inode(x) | CP | dnode(F) | inode(F)
    5. CP | inode(x) | dnode(F) | inode(DF)
    6. CP | inode(DF) | dnode(F)
    7. CP | dnode(F) | inode(DF)
    8. CP | dnode(F) | inode(x) | inode(DF)

    For example, #3, the three conditions should be changed as follows.

    inode(x) | CP | dnode(F) | inode(x) | inode(F)
    a) x o o o o
    b) x x x x o
    c) x o o x o

    If f2fs_sync_file stops ------^,
    it should write inode(F) --------------^

    So, the need_inode_block_update should return true, since
    c) get_nat_flag(e, HAS_LAST_FSYNC), is false.

    For example, #8,
    CP | alloc | dnode(F) | inode(x) | inode(DF)
    a) o x x x x
    b) x x x o
    c) o o x o

    If f2fs_sync_file stops -------^,
    it should write inode(DF) --------------^

    Note that, the roll-forward policy should follow this rule, which means,
    if there are any missing blocks, we doesn't need to recover that inode.

    Signed-off-by: Huang Ying
    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • This patch introduces a flag in the nat entry structure to merge various
    information such as checkpointed and fsync_done marks.

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • Previously, all the dnode pages should be read during the roll-forward recovery.
    Even worsely, whole the chain was traversed twice.
    This patch removes that redundant and costly read operations by using page cache
    of meta_inode and readahead function as well.

    Reviewed-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     

16 Sep, 2014

5 commits

  • If the inode is same and its data index are needed to truncate, we can fall into
    double lock for its inode page via get_dnode_of_data.

    Error case is like this.

    1. write data 1, 2, 3, 4, 5 in inode #4.
    2. write data 100, 102, 103, 104, 105 in dnode #6 of inode #4.
    3. sync
    4. update data 100->106 in dnode #6.
    5. fsync inode #4.
    6. power-cut

    -> Then,
    1. go back to #3's checkpoint
    2. in do_recover_data, get_dnode_of_data() gets inode #4.
    3. detect 100->106 in dnode #6.
    4. check_index_in_prev_nodes tries to truncate 100 in dnode #6.
    5. to trigger truncate_hole, get_dnode_of_data should grab inode #4.
    6. detect *kernel hang*

    This patch should resolve that bug.

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • The nm_i->fcnt checking is executed before spin_lock, so if another
    thread delete the last free_nid from the list, the wrong nid may be
    gotten. So fix the race condition by moving the nm_i->fnct checking
    into spin_lock.

    Signed-off-by: Huang, Ying
    Signed-off-by: Jaegeuk Kim

    Huang Ying
     
  • Now, if there is no free nid in nm_i->free_nid_list, 0 may be saved
    into next_free_nid of checkpoint, this may cause useless scanning for
    next mount. nm_i->next_scan_nid should be a better default value than
    0.

    Signed-off-by: Huang, Ying
    Signed-off-by: Jaegeuk Kim

    Huang Ying
     
  • If user wrote F2FS_IPU_FSYNC:4 in /sys/fs/f2fs/ipu_policy, f2fs_sync_file
    only starts to try in-place-updates.
    And, if the number of dirty pages is over /sys/fs/f2fs/min_fsync_blocks, it
    keeps out-of-order manner. Otherwise, it triggers in-place-updates.

    This may be used by storage showing very high random write performance.

    For example, it can be used when,

    Seq. writes (Data) + wait + Seq. writes (Node)

    is pretty much slower than,

    Rand. writes (Data)

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • Previously f2fs only counts dirty dentry pages, but there is no reason not to
    expand the scope.

    This patch changes the names on the management of dirty pages and to count
    dirty pages in each inode info as well.

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     

11 Sep, 2014

1 commit


10 Sep, 2014

9 commits

  • If application throws negative value of lseek with SEEK_DATA|SEEK_HOLE,
    previous f2fs went into BUG_ON in get_dnode_of_data, which was reported
    by Tommi Rantala.

    He could make a simple code to detect this having:
    lseek(fd, -17595150933902LL, SEEK_DATA);

    This patch should resolve that bug.

    Reported-by: Tommi Rentala
    [Jaegeuk Kim: relocate the condition as suggested by Chao]
    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • In gc_node_segment, if node page gc is run concurrently with node page
    writeback, and check_valid_map and get_node_page run after page locked
    and before cur_valid_map is updated as below, it is possible for the
    page to be written twice unnecessarily.

    sync_node_pages
    try_lock_page
    ...
    check_valid_map f2fs_write_node_page
    ...
    write_node_page
    do_write_page
    allocate_data_block
    ...
    refresh_sit_entry /* update cur_valid_map */
    ...
    ...
    unlock_page
    get_node_page
    ...
    set_page_dirty
    ...
    f2fs_put_page
    unlock_page

    This can be solved via calling check_valid_map after get_node_page again.

    Signed-off-by: Huang, Ying
    Signed-off-by: Jaegeuk Kim

    Huang Ying
     
  • We use flush cmd control to collect many flush cmds, and flush them
    together. In this case, we use two list to manage the flush cmds
    (collect and dispatch), and one spin lock is used to protect this.
    In fact, the lock-less list(llist) is very suitable to this case,
    and we use simplify this routine.

    -
    v2:
    -use llist_for_each_entry_safe to fix possible use-after-free issue.
    -remove the unused field from struct flush_cmd.
    Thanks for Yu's suggestion.
    -

    Signed-off-by: Gu Zheng
    Signed-off-by: Jaegeuk Kim

    Gu Zheng
     
  • In commit aec71382c681 ("f2fs: refactor flush_nat_entries codes for reducing NAT
    writes"), we descripte the issue as below:

    "Although building NAT journal in cursum reduce the read/write work for NAT
    block, but previous design leave us lower performance when write checkpoint
    frequently for these cases:
    1. if journal in cursum has already full, it's a bit of waste that we flush all
    nat entries to page for persistence, but not to cache any entries.
    2. if journal in cursum is not full, we fill nat entries to journal util
    journal is full, then flush the left dirty entries to disk without merge
    journaled entries, so these journaled entries may be flushed to disk at next
    checkpoint but lost chance to flushed last time."

    Actually, we have the same problem in using SIT journal area.

    In this patch, firstly we will update sit journal with dirty entries as many as
    possible. Secondly if there is no space in sit journal, we will remove all
    entries in journal and walk through the whole dirty entry bitmap of sit,
    accounting dirty sit entries located in same SIT block to sit entry set. All
    entry sets are linked to list sit_entry_set in sm_info, sorted ascending order
    by count of entries in set. Later we flush entries in set which have fewest
    entries into journal as many as we can, and then flush dense set with merged
    entries to disk.

    In this way we can use sit journal area more effectively, also we will reduce
    SIT update, result in gaining in performance and saving lifetime of flash
    device.

    In my testing environment, it shows this patch can help to reduce SIT block
    update obviously.

    virtual machine + hard disk:
    fsstress -p 20 -n 400 -l 5
    sit page num cp count sit pages/cp
    based 2006.50 1349.75 1.486
    patched 1566.25 1463.25 1.070

    Our latency of merging op is small when handling a great number of dirty SIT
    entries in flush_sit_entries:
    latency(ns) dirty sit count
    36038 2151
    49168 2123
    37174 2232

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • sit_i in macro SIT_BLOCK_OFFSET/START_SEGNO is not used, remove it.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Chao Yu
     
  • If the roll-forward recovery was failed, we'd better conduct fsck.f2fs.

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • This patch adds to handle corner buggy cases for fsck.f2fs.

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • This patch replaces BUG cases with f2fs_bug_on to remain fsck.f2fs information.
    And it implements some void functions to initiate fsck.f2fs too.

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • If any f2fs_bug_on is triggered, fsck.f2fs is needed.

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim